PAC  LEARNING  A DECISION  TREE  WITH  PRUNING 


By 

HYUNSOO  KIM 


A DISSERTATION  PRESENTED  TO  THE  GRADUATE  SCHOOL 
OF  THE  UNIVERSITY  OF  FLORIDA  IN  PARTIAL  FULFILLMENT 
OF  THE  REQUIREMENTS  FOR  THE  DEGREE  OF 
DOCTOR  OF  PHILOSOPHY 


UNIVERSITY  OF  FLORIDA 
1992 


UNIVERSITY  OF  nORTORW® 


Copyright  1992 
by 

Hyunsoo  Kim 


my  parents,  my  sisters  and  brother, 
and  my  wife  who  made  this  possible 


ACKNOWLEDGEMENTS 


First  of  all,  I would  like  to  express  my  sincere 
appreciation  to  Dr.  Gary  J.  Koehler,  chairman  of  my 
supervisory  committee  and  my  department.  My  research  here  was 
guided  by  his  invaluable  advice  and  continuous  encouragement. 
I am  also  deeply  grateful  to  his  kindness  in  every  aspect  of 
my  life  in  the  United  States. 

I am  grateful  to  Dr.  Harold  P.  Benson  for  his  ever  clear 
and  organized  teaching  in  courses  which  trained  me  for 
theoretical  research. 

I am  indebted  to  Dr.  Selcuk  Erenguc  for  his  considerate 
guidance  and  sincere  help  in  my  graduate  studies. 

I also  would  like  to  thank  to  Dr.  William  Messier  for  his 
kind  advice  on  my  research,  and  to  Dr.  Mark  Pendergast  for 
serving  on  this  committee. 

Finally  I would  like  to  thank  all  my  family  members  for 
their  support  and  encouragement.  My  special  thanks  go  to  my 
wife,  Yeongsil,  who  is  always  with  me  and  thinks  and  feels  for 
me  more  than  I do.  My  final  thanks  are  directed  to  my 

daughter,  Seoyoung,  who  always  gives  me  a special  motivation 
for  study. 


IV 


TABLE  OF  CONTENTS 


ACKNOWLEDGMENTS 

ABSTRACT  

CHAPTERS 


gage 

iv 

vii 


1 INTRODUCTION  

1.1  Motivation  of  Research  

1.2  Problem  Statement  

2 THEORETICAL  BACKGROUND  AND  RELATED  RESEARCH 


2.1  Learning  Theory  

2.2  Decision  Tree  Induction  Methods  .... 

2.3  Focus  of  Research  ’ 

3  PAC  LEARNING  A DECISION  TREE  WITH  PRUNING 


3.1  Introduction  

3.2  Binary  Decision  Trees  

3.3  Pruning  a Consistent  Decision  Tree  to 

a Desired  Rank  

3 . 4 Determining  the  Pruning  Error 

3.5  Sample  Size  Sufficient  for  PAC 

Identification  

3 . 6 Other  Pruning  Rules  

3 . 7 Chapter  Summary  


46 

47 

52 

63 

78 

87 

93 


4  THE  ACCURACY  OF  A PRUNED  DECISION  TREE 


4.1  Introduction  

4.2  Estimations  of  Error  of  a Pruned 

Decision  Tree  

4.3  An  Application  


5  AN  INVESTIGATION  ON  THE  CONDITIONS  OF  PRUNING  . . 115 


5.1  Introduction  

5.2  Fundamental  Situation  of  Pruning  .' 

5.3  A Bayesian  Analysis  on  the  Conditions  Where 

Pruning  is  Useful  


v 


5.4  A Generalization  on  the  Conditions  Where 

Pruning  is  Useful  124 

6 SUMMARY  AND  FUTURE  RESEARCH  146 

REFERENCES  . „ „ 


BIOGRAPHICAL  SKETCH 


156 


Abstract  of  Dissertation  Presented  to  the  Graduate  School 
of  the  University  of  Florida  in  Partial  Fulfillment  of  the 
Requirements  for  the  Degree  of  Doctor  of  Philosophy 

PAC  LEARNING  A DECISION  TREE  WITH  PRUNING 

By 

HYUNSOO  KIM 
May  1992 

Chairman:  Dr.  Gary  J.  Koehler 

Major  Department:  Decision  and  Information  Sciences 

This  dissertation  investigates  various  theoretical 
effects  of  pruning  a decision  tree.  Empirical  results  have 
shown  that  pruning  can  improve  the  accuracy  of  an  induced 
decision  tree.  Pruning  also  leads  to  more  concise  rules. 
First  we  provide  a pruning  algorithm  based  on  the  rank  of  a 
decision  tree.  A bound  on  the  error  due  to  pruning  by  the 
rank  of  a decision  tree  is  determined  under  the  assumptions  of 
an  equally  likely  distribution  over  the  instance  space  and  a 
deterministic  tree  labelling  rule.  This  bound  is  then  used 
with  recent  results  in  learning  theory  to  determine  a sample 
size  sufficient  for  Probably  Approximately  Correct  (PAC) 
identification  of  decision  trees  with  pruning.  We  also 
discuss  other  pruning  rules  and  their  effects  on  the  error  due 
to  pruning.  With  a nondeterministic  tree  labelling  rule,  we 
show  that  the  upperbound  of  the  average  pruning  error  is  less 
than  or  equal  to  0.5  under  an  equally  likely  distribution. 


Vll 


In  a realistic  learning  environment  it  is  often  not 
possible  to  obtain  a large  enough  sample  to  guarantee  PAC 
learning.  For  those  cases,  we  provide  several  methods  for  a 
posterior  evaluation  of  the  accuracy  of  a pruned  decision 
tree.  We  give  a method  which  estimates  a lower  bound  for  the 
worst  possible  confidence  factor,  <5,  by  using  a Beta  prior. 
Also,  we  give  a more  detailed  view  of  the  meaning  of  this 
lower  bound,  and  suggest  a way  to  improve  this  lower  bound. 

Finally  we  develop  conditions  under  which  pruning  is 
necessary  for  better  prediction  accuracy  as  well  as  for 
concept  simplification.  We  give  an  analysis  of  the  reason  why 
pruning  is  necessary  in  realistic  learning  situations. 

We  generalize  a previous  result  for  larger  training  sets. 

A Bayesian  analysis  shows  that  the  average  prediction  accuracy 
of  the  pruned  tree  increases,  and  the  effect  of  description 
noise  becomes  stronger  as  the  size  of  the  training  set 
increases.  For  very  large  training  sets,  the  pruned  tree  has 
the  prediction  accuracy  equal  to  that  of  the  unpruned  tree. 


vm 


CHAPTER  1 
INTRODUCTION 

1.1  Motivation  of  Research 

Learning  is  a general  term  denoting  the  way  in  which 
people  or  computers  increase  their  knowledge  or  improve  their 
skills  (Cohen  and  Feigenbaum,  1982)  . There  are  several  views 
of  learning.  One  comprehensive  view  is  that  "learning  denotes 
changes  in  the  system  that  are  adaptive  in  the  sense  that  they 
enable  the  system  to  do  the  same  task  or  tasks  drawn  from  the 
same  population"  (Simon  1983)  . Many  expert  systems  treat 
learning  as  the  acquisition  of  explicit  knowledge.  There  are 
also  other  views  such  that  learning  is  skill  acquisition  or 
that  learning  is  theory  formation  or  discovery. 

Human  learning  is  a long  slow  process.  Overcoming  such 
inefficiencies  of  human  learning  provides  primary  motivation 
for  research  in  machine  learning  (Simon  1983) . There  are  more 
compelling  reasons  for  machine  learning  in  the  business 
environment.  Expert  systems  are  indispensable  to  modern 
business.  Learning  has  been  an  important  and  effective  method 
in  acquiring  knowledge  for  expert  systems.  Knowledge 
acquisition  is  the  major  bottleneck  in  expert  systems 
development  since  it  is  difficult  to  elicit  an  expert's 


1 


2 


knowledge  by  reconstruction  methods  such  as  interviewing  an 
expert  (Musen  1989) . The  more  an  expert  knows,  the  less  able 
is  he  to  articulate  that  knowledge.  This  phenomenon  is  known 
as  "the  paradox  of  expertise"  (Johnson  1983).  Since  experts 
often  find  it  hard  to  articulate  their  expertise,  many 
researchers  are  trying  to  develop  alternative  knowledge 
acguisition  methods  such  as  machine  learning. 

As  we  develop  efficient  automatic  reasoning  procedures 
which  are  applicable  for  knowledge  acguisition,  it  is  possible 
that  the  bottleneck  of  expert  systems  will  be  solved  and 
expert  systems  will  become  more  practical. 

Expert  systems  have  been  used  in  business  applications  as 
diverse  as  production,  marketing,  finance,  accounting, 
personnel  and  strategic  management.  Several  benefits  of 
expert  systems  are  as  follows  (Parker  1989) : 

.They  preserve  knowledge  that  might  be  lost  through 
retirement,  resignation  or  death  of  an  acknowledged  company 
expert . 

•They  put  information  into  an  active  form,  so  it  can  be 
summoned  almost  as  one  might  summon  a real-life  expert. 

.They  can  assist  novices  to  think  like  experienced 
professionals . 

.They  are  not  subject  to  such  human  failures  as  fatigue, 
being  too  busy  or  being  subject  to  emotions. 


3 


These  benefits  can  lead  to  lower  costs,  better  service, 
higher  sales,  and  possibly,  significant  competitive 
advantages . 

As  expert  systems  are  used  in  everyday  business,  our 
business  environment  will  have  a wholly  different  shape  with 
flexible  labor  force  utilization  and  increased  productivity. 

There  is  also  a down-side.  The  office  of  the  future  may 
become  the  factory  of  the  past  (Garson  1988)  . As  tasks  that 
require  creativity  and  expertise  are  performed  by  the 
computerized  system  in  a highly  automated  and  mechanized  way, 
many  professional  people  may  be  de-skilled  and  lose  the  joy 
they  feel  in  planning  and  carrying  out  tasks. 

1.2  Problem  Statement 

Cohen  and  Feigenbaum  (1982)  divide  the  topic  of  learning 
into  four  areas:  rote  learning,  learning  by  being  told  (advice 
taking),  learning  from  examples  (induction),  and  learning  by 
analogy.  A brief  description  of  the  four  learning  situations 
follows.  Suppose  that  a learning  system  is  embedded  in  an 
environment  of  interest  and  a knowledge  base  is  used  by  the 
performance  element.  Here  the  knowledge  base  is  a collection 
of  knowledge  and  the  performance  element  is  a system 
performing  tasks  by  using  the  knowledge  base. 

1*  Rote  learning:  The  environment  supplies  knowledge  in 
a form  that  can  be  used  directly  by  the  performance  element. 


4 


The  learning  system  just  needs  to  memorize  the  knowledge  for 
later  use.  Though  rote  learning  is  a rudimentary  type  of 
learning,  this  is  widely  used  in  daily  human  life. 

2*  Learning by being  told:  Here  the  environment  gives 

vague,  general  purpose  knowledge  or  advice.  A learning  system 
must  transform  this  high-level  knowledge  into  a form  that  can 
be  used  readily  by  the  performance  element.  Davis's  (1982) 
TEIRESIAS  is  an  example  of  this  type  of  learning. 

3*  Learning from  examples:  In  learning  from  examples, 

examples  are  given  to  the  learning  system.  The  system 
generalizes  these  examples  to  find  higher  level  rules  that  can 
be  used  by  the  performance  element.  This  type  of  learning  has 
a long  history  under  the  name  of  "induction"  and  is  a powerful 
method  of  acquiring  knowledge. 

Learning  by  analogy:  If  a system  has  available  to  it 
a knowledge  base  for  a related  performance  task,  it  may  be 
able  to  improve  its  own  performance  by  recognizing  analogies 
and  transferring  the  relevant  knowledge  from  the  other 
knowledge  base. 

Some  researchers  add  more  types  of  learning  to  the  above 
four  types  of  learning.  For  instance,  Shaw  et  al . (1990)  list 
two  more  types  of  learning,  learning  by  competition,  and 
learning  from  observation  and  discovery. 

Learning  situation  can  also  be  categorized  by  settings 
where  learning  takes  place.  There  are  two  typical  settings, 
one  is  "supervised"  learning  where  a teacher  is  present  and 


5 


can  tell  anything  about  the  training  set,  and  the  other  is 
"unsupervised"  learning  where  a learning  system  has  no 
instructor. 

Inductive  learning  is  generally  considered  synonymous 
with  learning  from  examples.  However,  Angluin  and  Smith 
(1983)  distinguish  inductive  inference  (i.e.,  inductive 
learning)  from  learning  from  examples.  They  say  that  work  in 
artificial  intelligence  (i.e.,  learning  from  examples)  is  more 
concerned  with  cognitive  modeling  than  the  work  in  inductive 
inference,  and  less  concerned  with  formal  properties  such  as 
convergence  in  the  limit  or  computational  efficiency.  A 
learning  algorithm  is  said  to  learn  a concept  in  the  limit  if, 
after  some  finite  number  of  examples,  the  learner's  hypothesis 
is  correct,  and  thereafter  all  the  learner's  hypotheses  remain 
correct  (Laird  1987) . Convergence  in  the  limit  does  capture 
the  notion  that  the  learner  will  eventually  discard  any  false 
hypotheses,  and  that  in  finite  time  this  progression  will 
ultimately  converge  to  a fixed,  correct  rule.  The 
computational  complexity  of  a learning  algorithm  is  defined  if 
and  only  if  the  algorithm  converges  (Angluin  and  Smith  1983) . 
Computational  efficiency  can  be  considered  as  an  additional 
important  property  of  an  inductive  learning  algorithm  in 
practice.  The  construction  of  decision  trees  is  an  important 
type  of  inductive  learning  (Quinlan  1986;  Mingers  1989a; 
Utgoff  1989) . A decision  tree  (consisting  of  nodes  and 
branches)  represents  a collection  of  rules,  with  each  terminal 


6 


node  corresponding  to  a specific  decision  rule.  Decision 
trees  are  constructed  beginning  with  the  root  of  the  tree  and 
proceeding  down  to  its  leaves.  This  approach  may  be  used 
directly  for  predictive  or  descriptive  purposes  (Braun  and 
Chandler  1987;  Carter  and  Catlett  1987;  Messier  and  Hansen 
1988;  Shaw  and  Gentry  1990)  or  may  be  applied  to  knowledge 
acquisition  for  expert  systems  (Quinlan  1979;  Michalski  and 
Chilausky  1980;  Quinlan  1987b) . Decision  tree  induction  is 
free  from  parametric  and  structural  assumptions  that  most 
statistical  methods,  such  as  discriminant  analysis,  are  based 
on.  Researchers  have  found  that  large  decision  trees 
constructed  from  a training  set  usually  do  not  retain  their 
accuracy  over  the  whole  instance  space  (Quinlan  1983;  Breiman 
et  al.  1984  ; Spangler  et  al . 1989).  Recently  a number  of 
papers  have  investigated  pruning  large  decision  trees  built 
from  training  examples  (Niblett  1987;  Quinlan  1987a;  Fisher 
and  Schlimmer  1988;  Mingers  1989b). 

Many  of  the  branches  of  the  constructed  decision  trees 
will  reflect  chance  occurrences  in  the  particular  data  rather 
than  representing  true  underlying  relationships.  Often,  these 
are  very  unlikely  in  further  examples.  Since  these  less 
reliable  branches  can  be  removed  by  pruning,  the  pruned 
decision  tree  often  gives  better  classification  over  the  whole 
instance  space  even  though  it  may  have  a higher  error  over  the 
training  set. 


7 


While  these  papers  have  reported  excellent  empirical 
results  for  pruning  in  terms  of  improving  the  accuracy  of  the 
learned  concepts,  those  results  depend  heavily  on  the  specific 
training  set  and  the  domains  over  which  they  apply.  Whether 
these  results  hold  in  general  and  to  what  extent  pruning 
improves  a concept  are  unknown. 

In  this  study  we  investigate  various  theoretical  aspects 
of  pruning  a decision  tree.  We  first  review  learning  theory 
and  previous  research  on  decision  tree  induction  and  pruning. 
Then  we  determine  the  pruning  error  in  a typical  situation  and 
determine  the  number  of  examples  reguired  to  assure  a desired 
confidence  level.  We  also  estimate  the  accuracy  of  a pruned 
decision  tree  for  the  learning  situations  where  acquiring 
enough  examples  is  difficult  or  even  impossible.  Finally  we 
investigate  conditions  that  give  rise  to  increased  accuracy 
due  to  pruning.  (Each  term  will  be  rigoriously  defined  later.) 


CHAPTER  2 

THEORETICAL  BACKGROUND  AND  RELATED  RESEARCH 
2.1  Learning  Theory 

Since  the  early  1980s,  the  learning  theories  have  been 
developed  rapidly.  Mitchell's  version  space  (Mitchell  1982), 
Valiant's  PAC  (Probably  Approximately  Correct)  learning 
(Valiant  1984;  Angluin  and  Laird  1988),  and  Haussler's 
researches  (Haussler  1988,  1990)  are  important  building  blocks 
for  the  learning  theory.  In  general,  learning  theory 
considers  three  aspects  of  the  learning  process:  concept 
accuracy,  storage  efficiency,  and  computational  efficiency. 
In  this  section  we  summarize  main  stream  learning  theory  in 
terms  of  these  three  important  considerations. 

2.1.1  Mitchell's  Version  Space 

Mitchell  (1982)  gives  an  elegant  framework  for  viewing 
the  process  of  learning  from  examples  and  illustrates  this 
framework  by  analyzing  the  process  of  learning  simple 
conjunctive  concepts.  We  start  with  basic  definitions  and 
terminology  used  in  his  framework. 


8 


9 

Definition  2.0: 

' The  instance  space:  The  instance  space  is  the  space  of  all 
objects  of  interest.  Each  instance  of  a concept  can  be 
expressed  in  a feature-based  or  attribute-based  form. 
Attribute-based  instance  spaces  can  be  defined  by  the  values 
a set  of  attributes,  not  all  of  which  are  necessarily 

relevant.  Feature  (or  structure) -based  instance  spaces  can  be 
defined  by  allowing  each  instance  to  include  several  objects, 
each  with  its  own  attributes,  and  allowing  binary  relations 
that  define  a structure  between  objects. 

For  example,  define  an  instance  space  consisting  of  the 
following  attributes  and  binary  relations  (Haussler  1989) . 
Permissible  values  appear  in  parentheses. 

Attributes:  . size  (small,  medium,  large) 

. shape  (convex,  nonconvex) 

Binary  relations: 

. distance-between  (touching,  nontouching) 

. relative-position  (on-top-of,  under) 

Then  (size=small , shape=convex)  is  an  example  of  an  instance 
of  an  attribute-based  instance  space.  The  following  instance 
is  an  example  of  a structure-based  instance  space. 

(size=small,  shape=convex) 

(under,  touching)  ! 1 (on-top-of,  touching) 

(size=large,  shape=nonconvex) 

2-  Hypothesis  space:  The  hypothesis  space  is  the  space  of  all 
plausible  hypotheses.  it  is  often  called  the  rule  space. 


10 


3*  Inductive  bias:  A mechanism  whereby  the  space  of  hypotheses 
is  restricted  or  whereby  some  hypotheses  are  preferred,  a 
priori,  over  others  reflects  the  inductive  bias. 

Comunctive  concepts:  Concepts  described  by  logical 

expressions  involving  only  conjunctions  (i.e.,  AND  operations) 
are  called  conjunctive  concepts. 

5*  Disiunctive  concepts:  Concepts  described  by  logical 

expressions  involving  only  disjunction  (i.e.,  Inclusive  OR 
operations)  are  called  disjunctive  concepts. 

6*  Target  concept:  The  target  concept  is  the  true  concept.  A 
learning  system  tries  to  find  the  target  concept. 

Mitchell's  framework  of  learning  can  be  described  as 
follows.  Let  us  assume  that  we  are  trying  to  learn  some 
unknown  target  concept,  f,  defined  on  the  instance  space. 
This  target  concept  can  be  any  subset  of  the  instance  space. 
This  may  or  may  not  be  a conjunctive  concept.  Assume  we  have 
a set  of  examples  of  this  target  concept.  Each  example  is 
generated  by  "sampling  with  replacement",  and  is  one  of 
following  two  cases. 

Case  1.  An  instance  satisfying  the  target  concept. 

Case  2.  An  instance  not  satisfying  the  target  concept. 
An  example  in  Case  1 is  called  a "positive  example",  and  an 
example  in  Case  2 is  called  a "negative  example".  We  label 
each  example  accordingly.  We  call  these  labelled  examples  a 
sample,  S,  of  the  target  concept. 


11 


We  assume  a hypothesis  space,  H,  restricted  to  only 
conjunctive  concepts.  This  is  an  inductive  bias.  Then  the 
task  is  to  produce  a conjunctive  concept  that  is  consistent 
with  the  sample  or  to  detect  when  no  conjunctive  concept  is 
consistent  with  the  sample.  By  "consistent"  we  require  that 
a concept  contain  all  instances  of  positive  examples  and  no 
negative  examples. 

The  version  space  is  the  set  of  all  hypotheses  h e H that 
are  consistent  with  the  sample.  Since  the  version  space 
depends  on  the  hypothesis  space,  we  denote  the  version  space 
with  respect  to  the  hypothesis  space  H.  The  version  space  is 
empty  in  the  case  that  no  hypothesis  in  H is  consistent  with 
the  sample. 

Mitchell  shows  that  the  learning  task  (and  related  tasks) 
of  producing  a conjunctive  concept  consistent  with  the  sample 
can  be  solved  by  keeping  track  of  only  two  subsets  of  the 
version  space  the  set  of  the  most  specific  hypotheses  and  the 
set  of  the  most  general  hypotheses.  These  sets  are  updated 
accordingly  as  new  examples  are  given. 

Here  we  consider  a finite  instance  space,  and  we  assume 
no  examples  are  in  contradiction  each  other.  There  are  two 
cases  for  the  target  concept  f. 

Case  1:  The  target  concept  f e H 

Case  2 : The  target  concept  f € H 

For  Case  1,  the  version  space  reduces  until  it  contains 
only  the  target  concept  f,  as  examples  are  added  to  the 


12 


version  space.  For  Case  2,  the  version  space  reduces  until  it 
becomes  the  empty  set.  Note  that  for  both  cases,  the  version 
space  reduces  to  an  informative  terminal  state  which  can  tell 
the  result  of  the  learning  task.  If  we  stop  before  one  of  the 
terminal  states  is  produced,  then  the  learning  task  is 
incomplete  and  the  current  result  is  not  as  useful. 

We  say  that  the  version  space  is  exhausted  with  respect 
to  H (we  abbreviate  this  as  "w.r.t.  H")  if  the  version  space 
is  reduced  to  one  of  the  above  terminal  states  (i.e.,  is 
reduced  to  either  the  target  concept  f or  an  empty  set) . 

Consider  the  situation  that  the  version  space  contains 
only  one  hypothesis  h,  where  h is  not  the  target  concept.  If 
the  target  concept  f is  not  an  element  of  the  hypothesis  space 
H,  then  this  situation  may  occur.  For  this  situation,  we 
assume  that  it  is  always  possible  to  generate  a new  example 
which  eliminates  h from  the  version  space.  The  version  space 
will  be  empty  and  then  be  exhausted. 

Mitchell's  approach  to  inductive  learning  is  to  sample 
until  the  version  space  is  exhausted.  Stopping  short  of  an 
exhausted  version  space  leaves  one  with  incomplete  learning. 
However,  the  two  subsets  of  version  space  bound  the  space 
where  the  target  concept  may  exist. 


13 


2.1.2  Problems  of  Mitchell's  Framework 

There  are  two  practical  problems  with  Mitchell's 
approach.  The  first  is  that  it  may  require  too  many  examples 
to  exhaust  the  version  space  (Haussler  1988) . 

The  other  problem  is  that  even  if  we  monitor  only  the  two 
sets,  the  set  of  the  most  specific  hypotheses  and  the  set  of 
the  most  general  hypotheses,  the  storage  needed  can  still 
become  exponentially  large  as  we  build  up  examples  (Haussler 
1988) . 

These  problems  with  the  version  space  approach  are 
overcome  by  incorporating  probablistic  ideas  (Valiant  1984) . 
Here  we  will  not  require  the  complete  exhaustion  of  the 
version  space.  Instead,  we  will  require  that  a version  space 
is  "probably  almost  exhausted".  (This  term  will  be  formally 
defined  later.)  This  idea  will  do  away  with  the  first 
problem. 

To  handle  the  second  problem,  we  will  not  try  to  remember 
the  exact  version  space.  Instead,  we  will  require  that  any 
hypothesis  from  an  "almost  exhausted"  version  space  will 
accurately  approximate  the  target  concept. 

Hence,  we  replace  Mitchell's  idea  of  remembering  all 
consistent  hypotheses  by  a more  elegant  idea  of  drawing  enough 
examples  needed  for  a "probably  almost  exhaustion"  of  the 
version  space  and  then  finding  an  hypothesis  (or  hypotheses) 
consistent  with  these  examples. 


14 


The  idea  of  "almost  exhausted"  is  formalized  in 
Definition  2.1. 

Definition  2.1:  Given  a hypothesis  space  H,  a target  concept 
f,  a sequence  of  examples  S of  f,  and  an  error  tolerance  e, 
where  0 < e < 1,  the  version  space  of  S (w.r.t.  H)  is  e- 
exhausted  (w.r.t.  f)  if  it  does  not  contain  any  hypothesis 
that  has  error  ("error"  will  be  carefully  defined  later)  more 
than  e with  respect  to  f. 

2.1.3  Efficient  PAC  Learning 

Now  we  introduce  a formal  definition  of  Probably 
Approximately  Correct  (PAC)  learning  based  on  the  idea  of  an 
"almost  exhausted"  version  space  defined  in  the  previous 
subsection.  The  concept  of  computational  efficiency  is  also 
very  important  for  a learning  algorithm  to  be  practical  for 
larger  problems.  The  PAC  learning  paradigm  was  introduced  by 
Valiant  in  1984.  This  model  requires  that  a polynomial 
bounded  algorithm  identify  a concept  using  a random  sample, 
whose  size  is  polynomially  bounded,  such  that  a learned 
concept  has  a high  probability  of  being  close  to  the  true 
concept.  Angluin  and  Laird  (1988)  coined  the  terminology  PAC 
learning.  A more  precise  definition  follows. 

Let  X be  the  instance  space  of  interest.  The  target 
concept  f maps  X into  (0,1).  Similarly,  for  any  other  concept 


h,  we  have 


15 


h:  X - {0,1} . 

The  error,  d(h,f),  of  a learned  concept  h is  the  probability 
of  the  instances  incorrectly  classified  by  h.  That  is, 

d(h,f)  = Prob{  x e X:  h(x)  # f(x)  }. 

Prob{ } is  determined  by  an  arbitrary  sampling  distribution,  D, 
over  X.  Learning  is  accomplished  by  processing  a learning 
procedure  on  a sample  of  instances  called  the  training  sample. 
Sampling  is  assumed  to  be  with  replacement  with  samples  drawn 
independently . 

For  0 < e , £ < 1 , a learning  procedure  is  said  to  be  a 
probably  approximately  correct  (with  respect  to  D) 
identification  of  the  target  concept  f if 
Prob{  d(h,f)  > e } < 6. 

We  say  that  above  learning  procedure  is  an  efficient  PAC 
identifier  if  it  is  a polynomially  bounded  algorithm  which 
identifies  a concept  from  a random  sample,  whose  size  is 
polynomially  bounded. 

As  we  see  in  the  definition,  the  PAC  learning  model  is 
defined  for  {0,l}-valued  functions.  Recently,  Haussler  (1990) 
gives  a generalization  of  the  PAC  learning  model  that  is  based 
on  statistical  decision  theory  and  can  be  applied  for  multi- 
valued discrete  functions,  real-valued  functions  and  vector- 
valued functions.  In  this  generalized  model  the  learning 
system  receives  randomly  drawn  examples,  each  example 
consisting  of  an  instance  x e X and  an  outcome  ye  Y.  The 
learning  system  finds  a hypothesis 


16 


h:  X - A 

that  specifies  the  appropriate  action  a e A to  take  for  each 
instance  x,  in  order  to  minimize  the  expectation  of  a loss 
function  L(y,a).  Here  X,  Y and  A are  arbitrary  sets,  L is  a 
real“valued  function,  and  examples  are  generated  according  to 
arbitrary  joint  distribution  on  X x Y. 

2.1.4  The  Performance  of  a Learning  Algorithm 

As  we  have  seen  in  the  definition  of  PAC  learning,  two 
measures  of  learning  performance  are  relevant.  The  first  is 
sample  complexity,  and  the  second  is  computational  complexity. 

1)  Sample  Complexity:  The  sample  complexity  is  the  number 
of  random  examples  needed  to  produce  a hypothesis  that  with 
high  probability  has  small  error.  It  is  defined  by  taking  the 
number  of  random  examples  needed  in  the  worst  case  over  all 
the  target  concepts  in  the  class  and  all  the  probability 
distributions  on  the  instance  space. 

2)  Computational  Complexity:  The  computational  complexity 
is  the  worst  case  computation  time  to  produce  an  hypothesis 
from  a sample  of  a given  size. 

We  use  big-0  notation  to  denote  both  complexities. 

Vapnik  (1982)  was  the  first  to  give  a characterization  of 
the  sample  complexity  of  a learning  alogrithm.  Below  Lemma 
2.2  gives  a sufficient  sample  size  for  PAC  identification. 
(Also  see  Blumer  et  al.  1987a.) 


17 


Lemma  2.2:  Let  N be  the  number  of  rules  in  the  hypothesis 
space  H.  Let  f be  the  target  concept.  If  h is  any  hypothesis 
that  agrees  with  at  least 

m = (1/e)  ln(N/cS) 
random  samples,  then 

Prob { d (h , f ) > e } < S. 

However,  the  above  bound  is  very  loose,  and  if  the  size 
of  an  hypothesis  space  is  not  finite,  such  as  the  set  of 
intervals  on  the  real  line,  the  method  cannot  be  applied. 

So,  there  is  a need  to  improve  this  bound.  To  improve 
the  sample  complexity  we  may  use  several  different  measures  of 
a hypothesis  space  other  than  the  number  of  rules  in  the 
hypothesis  space  (Vapnik  1982;  Haussler  1988).  Two  other 
combinatorial  parameters  which  measure  the  characteristics  of 
a hypothesis  space  are  given  below. 

Definition  2.3:  (Growth  function.  VC  dimension) 

• Growth — function  ( 7rH(m)  ):  The  growth  function,  7rH(m),  is 
the  maximum  number  of  dichotomies  (i.e.,  the  maximum  number  of 
ways  of  partitioning  a set  into  a set  of  positive  instances 
and  a set  of  negative  instances)  induced  by  hypotheses  in  H on 
any  set  of  m instances. 

2*  Vapnik-Chervonenkis  dimension  of  H ( VCdim(H)  ) : Let  I be 
a set  of  instances  in  X.  If  H induces  all  possible  2^1 
dichotomies  of  I,  then  we  say  that  H shatters  I.  The  Vapnik- 


18 


Chervonenkis  dimension  of  H,  denoted  by  VCdim(H) , is  the 
cardinality  of  the  largest  finite  subset  I of  X that  is 
shattered  by  H,  or  equivalently,  the  largest  m such  that  the 
Growth  function  ?rH(m)  = 2m.  If  arbitrarily  large  subsets  of  X 
can  be  shattered,  then  VCdim(H)  = °o. 

Below  we  give  some  examples  of  the  above  three  measures 
for  the  hypothesis  space,  and  give  an  illustration  of  Lemma 
2.2. 

Consider  the  following  attributes  of  a firm.  Suppose  we 
are  to  characterize  successful  firms  using  the  following  list 
of  attributes. 


Attributes 


Values 


INDUSTRY  TYPE 


ELECTRONICS 

BANKING 

AUTOMOBILE 


SIZE 


LARGE 

MEDIUM 

SMALL 


STRATEGIC  PLANNING 
DEPARTMENT 


YES 

NO 


MIS  DEPARTMENT 


YES 

NO 


CURRENT  RATIO 


GREATER  THAN  3 . 0 
BETWEEN  1 . 5 AND  3 . 0 
LESS  THAN  1.5 


DEBT-EQUITY  RATIO 


GREATER  THAN  0.7 

LESS  THAN  OR  EQUAL  TO  0.7 


19 


Suppose  H is  a conjunctive  concept.  Since  each  attribute 
can  either  be  a term  of  a conjunctive  concept  or  not,  the 
number  of  rules  in  H is  | H | = 3343  = 1,728  . That  is,  N = 
1,728. 

Hence,  by  Lemma  2.2,  the  learned  concept  h has  error  e 
with  probability  1 - <5  after 

(1/e)  (ln(l/<5)+lnl728) 

= (1/e)  ( In ( 1/6 ) +7 . 4 55 ) 

random  independent  examples,  regardless  of  the  underlying 
distribution  governing  the  generation  of  these  examples.  Note 
that  the  number  of  examples  reguired  grows  slowly  compared  to 
the  size  of  the  hypothesis  space. 

Now  suppose  we  wish  to  learn  the  range  of  liabilities  to 
assets  ratios  within  which  successful  companies  operate.  The 
instance  space  X is  the  interval  [0,1].  Let  the  hypothesis 
space  H be  the  intervals  [x,  y]  with  0 < x < y < 1 plus  the 
empty  set.  Since  there  are  an  infinite  number  of  intervals  in 
[0,  1],  | H | =oo.  The  growth  function  for  H is  determined  as 
follows . 

Consider  the  single  example  with  value  0.6.  The  instance 
0.6  e X can  be  labelled  as  + (a  positive  example)  by  the 
concept  [0.5,  1]  and  - (a  negative  example)  by  the  concept  [0, 
0.5],  Hence  7rH(l)  = 2 = 21. 

Consider  two  examples  with  values  0.3  and  0.6.  The 
following  four  concepts  give  all  different  partitionings  of 
two  examples. 


20 


[0.1,  0.4]  gives  (+,-) ; 

[0.4,  0.7]  gives  (-,+) ; 

[0.2,  0.7]  gives  (+,+);  and 

[0.4,  0.5]  gives  (-,-) . 

Thus  tth(  2)  = 4 = 22 . 

Now  consider  three  examples  with  values  a,  b,  and  c, 
where  a < b < c.  Since  no  disjunction  of  intervals  is 
allowed,  no  concept  in  H can  classify  (a,  b,  c)  as  (+,-,+), 
but  all  other  classification  combinations  are  possible. 

Thus  7rH(3)  = 7 < 23 . 

Therefore,  VCdim(H)  = 2. 

We  can  improve  the  sample  complexity  of  a learning 
algorithm  when  VCdim(H)  is  considerably  less  than  In | H | . The 
following  lemma  gives  one  of  the  main  results  using  the  VC 
dimension.  (See  Blumer  et  al.  1987b  and  Haussler  1988  for  more 
details) 

Lemma  2.4:  (Blumer  et  al.  1987b)  Let  H be  any  nonempty 

hypothesis  space,  and  let  d be  the  VC  dimension  of  H,  where  d 
is  finite.  Let  f be  the  target  concept.  For  any  0 < e < 1, 
if  h is  any  hypothesis  that  agrees  with  at  least 

m = max  { (4/e ) log (2/5) , (8d/e ) log ( 13/e ) ) 

random  samples,  then 


Prob  { d (h,  f ) > e } < <5. 


21 


2.2  Decision  Tree  Induction  Methods 

A decision  tree  can  be  expressed  as  a disjunctive 
concept.  Each  path  in  a decision  tree  corresponds  to  a 

conjunction  of  variables  (with  values) , and  all  pathes  having 
the  same  class  in  their  leaves  can  be  combined  with 
disjunctions.  For  example, 

{ ( (Outlook=sunny)  and  ( Humidity=normal ) ) or 

(Outlook=overcast ) or  ( (Outlook=rain)  and  ( Windy=f alse) ) } 
is  a disjunctive  concept,  P,  represented  by  a decision  tree  in 
Figure  2.1  in  Subsection  2.2.2. 

2.2.1  Learning  Disjunctive  Concepts 

We  begin  with  the  definition  of  the  DNF  ( Disjunctive 
Normal  Form)  concept. 

Definition  2.5:  A disjunctive  normal  form  (DNF)  expression  is 
any  sum  + m2  + ...  + mr  of  monomials  where  each  monomial  m, 
is  a product  of  literals.  Here  "sum"  means  an  inclusive  OR 
operation  and  "product"  means  an  AND  operation.  A literal  is 
either  a variable  (i.e.,  an  attribute  with  value)  or  the 
negation  of  a variable.  An  expression  is  monotone  if  no 
variable  is  negated  in  it. 

For  the  case  of  monotone  DNF  expressions,  having  a bound, 
k,  on  the  length  of  each  disjunct  (we  call  this  a k-DNF 


22 


concept) , Valiant  (1984,  1985)  showed  that  a PAC  concept  can 
be  learned  in  polynomial  time  from  negative  examples  with  the 
sample  complexity  0(nk),  where  n is  the  number  of  variables. 
Haussler  (1988)  improves  this  result  by  using  a dual  greedy 
method  which  is  a variant  of  the  "star"  methodology  of 
Michalski  (1983).  The  improved  bound  is  O((log  kn)2). 

Rivest  (1987)  showed  k-DNF  concepts  without  a 
monotonicity  restriction  are  polynomially  learnable  using 
decision  lists.  A decision  list  is  an  extended  "if-then- 
elseif-else"  rule,  where  the  tests  in  "if"  parts  are 
conjunctions  of  literals  drawn  from  2n  literals.  (See  Rivest 
1987  for  the  precise  definition.)  Compared  to  decision  trees, 
decision  lists  have  a simpler  structure,  but  the  complexity  of 
the  decisions  allowed  at  a node  is  greater.  Let  k-DL  be  the 
set  of  all  Boolean  functions  defined  by  decision  lists,  where 
each  function  in  the  list  is  a term  of  size  at  most  k.  k- 
DNF(n)  is  a proper  subset  of  k-DL(n) , where  n is  the  number  of 
variables  used  in  the  expression.  Rivest  shows  that  k-DL  is 
polynomial-sized  and  polynomially-identif iable.  If  a class  of 
formulae  is  polynomial-sized  and  polynomially  identifiable, 
then  it  is  polynomially  learnable.  However,  computation  of 
the  "shortest"  decision  list  consistent  with  a given  sample  is 
an  NP-hard  problem  (Rivest  1987) . 


23 


2.2.2  Learning  a Decision  Tree 

A decision  tree  has  much  expressive  power  in  the  sense 
that  it  is  concise  and  that  there  is  no  limitation  on  the 
attributes  and  classifications  allowed,  and  is  a more  complex 
structure  than  the  DNF  concepts  or  decision  lists.  Therefore 
little  theoretical  research  has  been  done  on  decision  trees, 
and  most  research  on  learning  a decision  tree  is  based  on 
heuristic  reasoning.  In  this  subsection  we  give  a typical 
example  of  learning  a decision  tree  and  review  previous  work 
on  decision  tree  induction. 

Learning  a decision  tree  requires  a training  set  and  a 
learning  algorithm.  The  training  set  in  Table  2.1  was  given 
by  Quinlan  (1986).  In  Figure  2.1,  a decision  tree  was 
produced  from  Table  2 . 1 by  Quinlan's  (1986)  ID3  algorithm. 

Here  we  have  four  attributes  having  values  in 
parentheses . 

1.  Outlook  (sunny,  overcast,  rain) 

2.  Temperature  (hot,  mild,  cool) 

3.  Humidity  (high,  normal) 

4.  Windy  (true,  false) 

Each  line  in  Table  2.1  corresponds  to  an  example. 

The  decision  tree  in  Figure  2.1  can  be  interpreted  as  a 
set  of  five  production  rules. 

Rule  1:  If  (Outlook  = sunny)  and  (Humidity  = high) 


Then  N 


24 


Table  2.1: 

Example  training 

set 

ATTRIBUTES 

Member 

Outlook 

Temperature 

Humidity 

Windy 

of 

sunny 

hot 

sunny 

hot 

overcast 

hot 

rain 

mild 

rain 

cool 

rain 

cool 

overcast 

cool 

sunny 

mild 

sunny 

cool 

rain 

mild 

sunny 

mild 

overcast 

mild 

overcast 

hot 

rain 

mild 

high 

false 

N 

high 

true 

N 

high 

false 

P 

high 

false 

P 

normal 

false 

P 

normal 

true 

N 

normal 

true 

P 

high 

false 

N 

normal 

false 

P 

normal 

false 

P 

normal 

true 

P 

high 

true 

P 

normal 

false 

P 

high 

true 

N 

high 


— sunny 
■[Humidity  ]—j 

normal 


N 


[Outlook] 

overcast 

P 


rain 


— [Windy] — 
true  false 

I 

N 


Figure  2.1:  A decision  tree  learned  from  a training  set 
in  Table  2.1. 


*:  [ ] denotes  an  attribute. 


Rule  2:  If  (Outlook  = sunny)  and  (Humidity  = normal) 
Then  P 


25 


Rule  3:  If  (Outlook  = overcast) 

Then  P 

Rule  4:  If  (Outlook  = rain)  and  (Windy  = true) 

Then  N 

Rule  5:  If  (Outlook  = rain)  and  (Windy  = False) 

Then  P 

There  are  a number  of  algorithms  for  learning  decision 
trees  (Quinlan  1979,  1983,  1986,  1987a;  Niblett  1987;  Utgoff 

1989;  Mingers  1989a,  1989b).  Most  algorithms  involve  three 

main  stages: 

1)  Construct  a complete  tree  able  to  exactly  classify  all 
the  examples. 

2)  Prune  this  tree  to  give  statistical  reliability. 

3)  Process  the  pruned  tree  to  improve  understandability . 
Some  algorithms  adopt  pruning  techniques  while  they  construct 
a decision  tree.  Several  construction-time  pruning  methods 
will  be  given  in  2. 2. 2. 3. 

2. 2. 2.1  Decision  tree  construction 

Decision  tree  construction  can  be  performed  incrementally 
or  nonincrementally . 

1)  Nonincremental  induction:  A nonincremental  algorithm 
infers  a concept  once,  based  on  the  entire  set  of  available 


26 


training  instances.  Quinlan  (1979,  1986)  developed  a non- 
incremental  algorithm,  ID3,  for  inducing  a decision  tree. 

The  overall  approach  to  constructing  a decision  tree  is 
to  choose  an  attribute  that  best  divides  the  examples  into 
their  classes,  and  then  partition  the  data  according  to  the 
values  of  that  attribute.  This  process  is  recursively  applied 
to  each  partitioned  subset  with  the  procedure  terminating  when 
all  examples  in  the  current  subset  have  the  same  class.  (This 
is  a termination  criterion.  We  may  change  this  strict 
termination  criterion  depending  on  the  situation.) 

ID3  can  build  a decision  tree  by  using  the  whole  training 
set.  However,  since  this  approach  is  often  computationally 
inefficient,  ID3  often  uses  an  iterative  procedure.  In  this 
iterative  framework,  a subset  of  the  training  set,  called  the 
window,  is  chosen  at  random  and  a decision  tree  formed  from  it 
by  using  the  overall  approach  of  building  a decision  tree. 
This  tree  correctly  classifies  all  examples  in  the  window. 
AH  other  examples  in  the  training  set  are  then  classified 
using  the  tree.  If  the  tree  correctly  classifies  all  other 
examples,  then  it  is  correct  for  the  entire  training  set  and 
the  process  terminates.  If  it  does  not  correctly  classify  all 
other  examples,  a selection  of  the  incorrectly  classified 
examples  is  added  to  the  window  and  the  process  repeated. 

Quinlan's  (1986)  result  showed  that,  in  this  way,  correct 
decision  trees  have  been  found  after  only  a few  iterations  for 
training  sets  of  up  to  30  thousand  examples  described  in  terms 


27 


of  up  to  50  attributes.  Empirical  evidence  suggests  that  a 
correct  decision  tree  is  usually  found  more  quickly  by  this 
iterative  method  than  by  forming  a tree  directly  from  the 
entire  training  set  (Quinlan  1986;  Wirth  1988).  Note  that  the 
iterative  framework  cannot  guarantee  that  "better"  trees  have 
not  been  overlooked.  By  better  we  mean  the  tree  is  "more 
general".  That  is,  a better  tree  is  a more  comprehensive  and 
concise  description  of  decision  rules  for  the  situation. 

The  choice  of  an  attribute  for  the  root  of  the  tree  is 
crucial  if  the  decision  tree  is  to  be  simple.  Since 

Quinlan's  original  work,  there  have  been  a number  of 
alternative  suggestions  for  measures  to  be  used  in  selecting 
attributes.  Some  of  them  are  as  follows. 

a)  Quinlan's  information  measure:  Quinlan ( 1979 , 1983 ) 

proposed  an  evaluation  function  based  on  a classic  formula 
from  information  theory  that  measures  the  theoretical 
information  content  of  a code.  The  measure  is 
- Ipilogfpj 

where  pt  is  the  probability  of  the  i-th  message.  The  value  of 
this  measure  depends  on  the  likelihood  of  the  various  possible 
messages.  If  they  are  equally  likely  (and  so  the  px  are 

equal) , there  is  the  greatest  amount  of  uncertainty  and  the 
information  gained  will  be  greatest.  The  less  equal  the 
probabilities,  the  less  information  there  is  to  be  gained. 
The  value  of  the  function  also  depends  on  the  number  of 
possible  messages. 


28 


b)  The  chi-square  contingency  table  statistic  (Mingers 
1986,  1987):  This  is  the  traditional  statistic  for  measuring 
the  association  between  two  variables  in  a contingency  table. 
It  compares  the  observed  frequencies  with  the  frequencies  that 
one  would  expect  if  there  were  no  association  between  the 
variables. 

c)  The  G statistic  (Mingers  1989a) : The  G statistic  is 
defined  as  follows. 

G = 2N  * IM, 

where  N is  the  number  of  examples,  IM  is  Quinlan's  information 
measure. 

d)  Probability  measures  (Mingers  1989a)  : Instead  of  using 
the  value  of  chi-square  or  G statistics  by  itself,  this  method 
computes  the  probability  of  such  a value  on  the  chi-square 
distribution  with  the  assumption  that  the  attribute  and  the 
classes  have  no  relationship. 

e)  The _ GIN I index  of  diversity  (Breiman  et  al.  1984):  Let 
Pi  be  the  probabilities  of  each  class.  GINI  function  measures 
the  "impurity"  of  an  attribute  with  respect  to  the  classes  as 
follows : 

The  general  GINI  function  = £ pA2. 

The  larger  the  GINI  value,  the  more  impure  is  the  attribute. 

f)  Gain-ratio  measure  (Quinlan  1986):  Quinlan's  (1986) 
gain-ratio  measure  of  attribute  a,  GR ( a ) , is 


GR(a)  = IM(a)/IV(a) , 


29 


where  IV(a)  is  the  information  value  of  attribute  a.  As  we 
see  in  the  formula,  this  measure  incorporate  the  idea  that  an 
attribute  itself  has  some  information  value. 

9)  Marshall  correction  (Marshall  1986) : This  correction 
factor  multiplies  any  calculated  measure  by  the  product  of  row 
totals  giving  the  sum  of  probabilities  of  each  partition.  The 
reason  for  this  is  that  it  is  preferable  to  obtain  a partition 
which  is  balanced.  Let  G be  any  measure.  Then  Marshall's 
corrected  measure  G * = G*A*B  if  A and  B are  row  totals. 

There  are  also  various  methods  other  than  ID3  for 
decision  tree  construction. 

h)  Hyperplane  cut:  Koehler  and  Majthay  (1988)  merged  AI 
induction  and  the  classical  discriminant  analysis  method. 
They  generate  a hyperplane  to  partition  the  training  set 
recursively.  So,  by  choosing  a combination  of  more  than  one 
attribute  as  a node  at  each  step,  they  show  that  the 
constructed  decision  tree  has  better  classification  power  than 
ID3 . 

i)  Attribute-Value  pair;  Spangler  et  al.  (1989)  devised 
a metric  which  uses  attribute-value  pairs  to  select  the  best 
partition.  They  generate  strictly  binary  trees  by  choosing 
the  best  attribute-value  pair,  rather  than  the  best  attribute, 
at  any  choice  point  (also  see  Cheng  et  al.  1988). 

j)  Use  of  background  knowledge:  Nunez  (1991)  presented  a 
decision  tree  induction  algorithm  which  executes  several  types 
of  generalization  and  at  the  same  time  reduces  the 


30 


classification  cost  by  means  of  background  knowledge.  The 
background  knowledge  contains  an  ISA  hierarchy  and  the 
measurement  cost  associated  with  each  attribute.  Buntine 
(1989)  gives  an  experimental  result  on  the  effectiveness  of 
Bayesian  classifiers. 

2)  Incremental  induction:  An  incremental  induction 
algorithm  revises  the  current  concept,  whenever  necessary,  in 
response  to  each  newly  observed  training  instance.  This  is 
appropriate  for  learning  tasks  in  which  there  is  a stream  of 
training  instances.  Utgoff's  (1989)  ID5  is  an  example  of  this 
type  of  algorithm  for  decision  tree  induction.  Van  de  Velde 
(1990)  proposes  an  incremental  algorithm  for  the  induction  of 
decision  trees  which  are  topologically  minimal  in  the  sense 
that  an  attribute's  activity  is  localized  as  much  as  possible 
in  the  tree. 

There  are  a number  of  other  directions  of  research  on  the 
effectiveness  of  various  decision  tree  construction  methods. 
Cheng  et  al.  (1988)  identified  two  causes  of  over- 
specialization in  ID3 — the  irrelevant  value  problem  and  the 
missing  value  problem.  They  developed  a new  algorithm  that 
avoids  these  problems.  Quinlan  and  Rivest  (1989)  used  the 
minimum  description  length  principle  to  find  a decision  tree 
which  minimizes  the  total  information  required  to  specify  the 
class  of  all  training  examples.  This  approach  can  be  also 
considered  as  a method  of  pruning.  Chan  (1989)  described  a 
binary  classification  tree  (BCT)  which  is  a system  that  learns 


31 


from  examples  and  represents  learned  concepts  as  a binary 
polythetic  decision  tree.  Polythetic  trees  differ  from 
monothetic  decision  trees  in  that  a logical  combination  of 
multiple  (vs.  a single)  attribute  values  may  label  each  tree 
branch.  Their  empirical  result  showed  that  BCT  is  more 
accurate  than  ID3.  Chan  and  Wong  (1990)  presented  a 
probabilistic  inductive  learning  system  and  showed  better 
performance  than  ID3  in  terms  of  computational  efficiency  and 
classification  accuracy.  Utgoff  and  Brodley  (1990)  presented 
PT2,  an  algorithm  which  is  incremental  and  searches  for  a 
multivariate  split  at  each  node.  Gelfand  et  al.  (1991) 
propose  an  iterative  growing  and  pruning  algorithm  for 
classification  trees.  This  method  divides  the  data  sample 
into  two  subsets  and  iteratively  grows  a tree  with  one  subset 
and  prunes  it  with  the  other  subset,  successively 

interchanging  the  role  of  the  two  subsets.  Chou  (1991) 

present  an  iterative  algorithm  that  finds  a locally  optimal 
partition,  which  minimizes  an  expected  loss,  for  an  arbitary 
loss  function. 


2. 2. 2, 2 PAC-learnina  a decision  tree 

Here  we  determine  a sample  complexity  for  PAC-learning  a 
decision  tree.  Consider  a hypothesis  space,  H,  of  n 
attributes.  Let  C*  be  the  number  of  permissible  values  for 
attribute  i,  where  1 < i < n.  Then  the  total  number  of 
distinct  instances  in  the  domain  X is 


32 


d = n ct . 

i- 1 

Since  each  distinct  instance  can  have  a positive  or  negative 
label,  the  total  number  of  concepts  in  the  hypothesis  space  is 
| H | = 2d. 

Note  that  any  sample  of  distinct  instances  can  be  shattered  by 
H.  Since  the  maximum  number  of  possible  distinct  instances  is 

d,  the  Vapnik-Chervonenkis  dimension  of  the  hypothesis  space 
is 

VCdim(H)  = d. 

To  determine  the  sample  size  for  (e , 6) -learning,  we  may  use 
the  following  theorem  from  Tsai  and  Koehler  (1991). 

Theorem  2.6:  (Sample  size  for  ID3 : Tsai  and  Koehler  1991) 

1.  For  any  given  e and  <5,  with  0<e,6<l,  if  the  sample 
size  is  at  least 

[ In ( 1/6 ) + d ln2  ] / e, 

then  ID3  will  e-exhaust  H with  probability  at  least  1-6. 

2 . For 

0 < e < 1/2 

then  ID3  must  use  a sample  size  of  at  least 

max{  [ (l-e)/e]ln(l/6) , d [ 1-2 ( e ( 1-6 ) +6 ] }. 


33 


For  example,  suppose  the  number  of  attributes  is  five, 
and  each  attribute  have  five  possible  values.  Further  suppose 
we  require  e = .1  and  S = .01.  Then  by  Theorem  2.6 

max{  41.45,  2443.75  } = 2444 
samples  would  be  necessary. 

If  there  are  any  real-valued  attributes,  then  VCdim(H) 
will  be  infinity  since  the  number  of  permissible  values  for 
that  attribute  is  infinite.  Many  decision  tree  induction 
algorithms,  however,  often  modify  the  algorithm  to  handle  this 
situation.  For  example,  they  divide  the  range  of  an  real- 
valued attribute  by  half,  or  divide  the  range  into  a finite 
meaningful  interval  based  on  the  set  of  values  of  an  attribute 
obtained  in  the  training  sample.  With  this  modification, 
Theorem  2.6  can  be  applied  with  finite  VCdim(H) . Theorem  2.6 
can  be  used  for  most  decision  tree  induction  algorithms  which 
select  one  attribute  at  a time. 

2. 2. 2. 3 Pruning  a decision  trpp 

When  a decision  tree  algorithm  is  used  with  "uncertain" 
data  rather  than  deterministic  data,  the  pruning  stage  is 
important  to  remove  branches  with  little  statistical  validity. 
By  uncertain  we  mean  "having  error  in  representing  the  true 
concept".  Uncertainty  in  the  data  may  be  due  to  noise  in  the 
measurements  or  to  the  presence  of  factors  (i.e.,  attributes) 
which  are  hidden  or  cannot  be  measured.  In  the  following  we 
will  use  the  word  "noisy  data"  interchangeably  with  "uncertain 


data" 


34 


since  many  researchers  use  "noisy  data"  in  this 
situation.  Recently  a number  of  papers  investigated  pruning 
large  decision  trees  (Niblett  1987;  Quinlan  1987a;  Fisher  and 
Schlimmer  1988;  Mingers  1989b).  Many  of  the  branches  of  the 
constructed  large  decision  trees  will  reflect  chance 
occurrences  in  the  particular  data  rather  than  representing 
true  underlying  relationships.  Often,  these  are  very  unlikely 
in  further  examples.  Since  these  less  reliable  branches  can 
be  removed  by  pruning,  the  pruned  decision  tree  often  gives 
better  classification  over  the  the  whole  instance  space  even 
though  it  may  have  a higher  error  over  the  training  set. 
Several  empirical  methods  have  been  proposed  for  pruning 
decision  trees.  We  distinguish  construction-time  pruning  from 
pruning  after  building  a fully-grown  decision  tree  (i.e., 
post-pruning) . 

Construction-time  pruning  methods  are  mainly  used  to 
decide  when  to  stop  expanding  a decision  tree.  These  methods 
replace  the  old  termination  criterion  to  stop  expanding  a tree 
when  all  examples  in  the  current  subset  are  in  the  same  class 
by  a new  termination  criteria,  which  is  related  to  the 
selection  measure  used  in  constructing  a decision  tree.  Two 
common  approaches  in  construction-time  pruning  are  the 
threshold  method  and  chi-square  test  method. 

!)  Threshold  method;  Let  SM(a)  be  the  value  of  the 
current  selection  measure  at  node  a.  The  threshold  method 
chooses  a threshold  t by  some  criteria.  If  SM(a)  is  less  than 


35 


t,  then  it  stop  expanding  a decision  tree  at  that  node.  (See 

Spangler  et  al.  (1989)  and  Breiman  et  al . (1984)  for  the 
details) 

2)  Chi-square  test  method:  In  this  method,  we  assume  that 
at  a certain  node,  the  subtree  created  at  that  node  will  have 
the  same  class  distribution  as  the  current  tree.  With  that 
assumption,  we  can  calculate  the  chi-square  statistic  for  each 
attribute.  We  stop  expanding  a tree  if  the  chi-square 
statistic  is  not  significant  at  a predetermined  significance 
level.  (Quinlan  (1983)  chose  .01  as  the  significance  level) 

There  are  also  some  other  variations  such  as  the  cross 
validation  method  used  in  the  Assistant  program  (Niblett 
1987)  . 

As  indicated  by  Niblett  (1987) , construction-time  pruning 
methods  have  a weakness  in  that  the  criterion  to  stop 
expanding  a tree  is  being  made  on  local  information  alone. 
That  is,  we  cannot  ignore  the  possibility  that  decendent  nodes 
of  the  node  "a"  may  have  better  discriminating  power.  Breiman 
et  al.  (1984)  have  also  reached  the  same  conclusion  that 
looking  for  the  right  stopping  rule  is  the  wrong  way  of 
looking  at  the  problem,  and  pruning  upward  in  the  right  way  is 
more  satisfactory.  The  following  methods  of  pruning  use 
global  information,  and  are  used  to  prune  a fully-grown 
decision  tree. 


3)  Error  complexity  pruning:  Breiman  et  al . (1984)  have 
developed  a two-stage  method  for  pruning.  The  first  stage 


36 


generates  a series  of  trees  pruned  by  different  amounts.  The 
second  stage  selects  one  of  these  by  examining  the  number  of 
classification  errors  each  of  them  makes  on  an  independent 
data  set  (a  test  or  holdout  data  set) . In  pruning,  the  error- 
complexity  method  must  take  account  of  both  the  number  of 
errors  and  the  complexity (size)  of  the  tree. 

4)  Critical  value  pruning:  Mingers ' (1987)  method  relies 
on  estimating  the  importance  or  strength  of  a node  from 
calculations  done  in  the  tree  construction  stage.  It 
specifies  a critical  value  and  prunes  those  nodes  which  do  not 
reach  the  critical  value,  unless  a node  further  along  the 
branch  does  reach  it.  The  larger  the  critical  value  selected, 

the  greater  the  degree  of  pruning  and  the  smaller  the 
resulting  tree. 

5)  Minimum-error  pruning;  Niblett  and  Bratko  (1986)  have 
developed  a method  to  find  a single  tree  which  should, 
theoretically,  give  the  minimum  error  rate  when  classifying 
independent  sets  of  data.  It  is  based  on  the  assumption  of 
equally  likely  classes.  If  there  are  k classes,  and  n is  the 
number  of  examples,  and  nc  examples  are  of  class  c,  then  the 
expected  error,  E,  is 

(n-nc+k-l)/  (n+k)  . 

For  this  method,  the  number  of  classes  strongly  affects  the 
degree  of  pruning,  leading  to  unstable  results  (Mingers  1989) 


37 


6)  Reduced  error  Pruning;  Quinlan  (1987a)  suggested  a 
method  which  produces  a series  of  pruned  trees  by  using  test 
data  directly,  rather  than  using  it  only  for  the  selection  of 
the  best  tree.  This  approach  generates  a set  of  trees,  ending 
with  the  smallest  minimum-error  tree  on  the  test  data. 

7)  Pessimistic  error  pruning:  Quinlan  (1987a)  suggested 
using  the  continuity  correction  for  the  binomial  distribution 
to  obtain  a more  realistic  estimate  of  the  misclassif ication 
rate. 

In  the  following  we  give  an  example  of  pruning  a decision 
tree.  Messier  and  Hansen  (1988)  developed  a decision  tree 
using  ID3  to  provide  rules  to  predict  loan  default  by 
companies.  They  started  with  18  financial  ratios  and  trends 
for  each  of  their  training  instances. 

In  Figure  2.2  we  have  altered  their  final  tree  to  express 
all  the  variables  as  binary  variables.  The  exact  meaning  of 
high  and  low  for  each  attribute  is  given  below: 


Attributes 

low 

high 

Current  Ratio 

< 

1.912 

> 

1.912 

Long-term  Debt/Net  Worth 

< 

. 486 

> 

. 486 

Lo  Long-term  Debt/Net  Worth 

< 

. 046 

> 

.046  and  <.486 

Working  Capital/Sales 

< 

. 222 

> 

. 222 

Net  Income/Total  Assets 

< 

. 100 

> 

. 100 

Net  Income/Sales 

< 

. 010 

> 

. 010 

38 


Current 


/ 

low  / 

/ 

Default 

(10)* 


Ratio 

\ 

\ high 

\ 

Long-term  Debt 
Net  Worth 


low  / 

/ 

Lo  Long-term  Debt 
Net  Worth 


\ 

\ high 

\ 

Working  Capital 
Sales 


/ \ 

low  / \ 

/ 

Net  Income 
Total  Assets 

/ \ 

low  / \ high 

/ \ 

Default  No  Default 

(1)  (1) 


high 

\ 

No  Default 
(11) 


/ \ 

low  / \ high 

/ \ 

Net  Income  Default 
Sales  (4) 

/ \ 


low  / \ high 

/ \ 

Default  No  Default 
(1)  (4) 


*:  The  number  of  examples  in  each  terminal  node. 
Figure  2.2:  Decision  tree  predicting  loan  default 


To  give  an  example  of  pruning,  we  use  the  Error- 
complexity  method  of  Breiman  et  al . (1984)  and  Minimum-error 
pruning  of  Niblett  and  Bratko  (1986) . 

Define  the  error-complexity  measure,  a,  for  a non- 
terminal (nonleaf)  node  t as  follows: 
a = (eP  - eJ  / (nt  -1)  , 

where  ep  = the  error  rate  when  we  prune  node  t, 

eu  — the  error  rate  of  the  unpruned  tree,  and 
nt  - the  number  of  leaf  nodes  which  are  decendents  of  t. 
The  Error-complexity  method  calculates  a for  each  nonterminal 
node  and  selects  a node  with  the  smallest  value  of  a for 
pruning.  (That  is,  find  the  weakest  link  in  the  tree) . This 


39 


process  is  repeated  and  a tree  is  selected  as  the  final  tree 
based  on  the  misclassif ication  rate  over  an  independent  test 
set.  In  Figure  2.3  we  show  the  a values  for  each  node. 
Figure  2.4  shows  the  tree  after  pruning  the  node  labelled  Lo 
Long-term  Debt/Net  Worth,  which  had  the  smallest  value  of  a. 


Current 

/ 

low  / 

/ 

Default 

(10)* 


Ratio  (1/12)’"' 

\ 

\ high 

\ 

Long-term  Debt  (3/80) 
Net  Worth 


low  / 

/ 

( 1/64 ) Lo  Long-term  Debt 
Net  Worth 


\ 

\ high 

\ 

Working  Capital (1/16) 
Sales 


/ 

low  / 

/ 

(1/32) Net  Income 
Total  Assets 

/ \ 

low  / \ high 

/ \ 

Default  No  Default 

CD  (1) 


\ 

\ high 

\ 

No  Default 
(11) 


/ \ 

low  / \ high 

/ \ 

Net  Income  Default 
Sales ( 1/32 ) (4) 

/ \ 


low  / \ high 

/ \ 

Default  No  Default 
(1)  (4) 


*:  The  number  of  examples  in  each  terminal  node. 
**:  a values  for  each  node. 


^■'■<3ljre  2.3.  Decision  Tree  with  a values  for  each  node 


40 


Current 

/ 

low  / 

/ 

Default 

(10)* 


Ratio  (15/128)** 

\ 

\ high 

\ 

Long-term  DebtfS/Qfi^ 
Net  Worth 


/ 

low  / 

/ 

No  Default 
(12:1)  *** 


\ 

\ high 

\ 

Working  Capital (1/16 
Sales 


/ \ 

low  / \ high 

/ \ 

(1/32) Net  Income  Default 
Sales  (4) 

/ \ 

low  / \ high 

/ \ 


Default  No  Default 
(1)  (4) 


) 


*.  The  number  of  examples  in  each  terminal  node. 

**:  a values  for  each  node. 

***:  12  examples  are  in  No  Default,  One  example  is  in  Default. 


Figure  2.4:  Decision  Tree  with  a values  for  each  node 


In  Figure  2.4,  the  node,  Net  Income/Sales,  has  the 
smallest  a value,  and  can  be  pruned.  Figure  2.5  shows 
tree  after  pruning  the  node  labelled  Net  Income/Sales. 


the 


41 


Current  Ratio  (14/96) 

/ \ 


low  / 

/ 

Default 


(10)* 


\ high 

\ 

Long-term  Debt ( 1/16 
Net  Worth 


) 


/ 

low  / 

/ 

No  Default 
(12:1)*** 


\ 

\ high 

\ 

Working  Capita  1 f 3 /~3  ? 1 
Sales 

/ \ 

low  / \ high 

/ \ 

No  Default  Default 

(4:1)  (4) 


*:  The  number  of  examples  in  each  terminal  node. 

**:  a values  for  each  node. 

***.  12  examples  are  in  No  Default,  One  example  is  in  Default. 


Figure  2.5:  A Pruned  Decision  Tree  predicting  Loan  Default 


Figure  2.6  shows  the  tree  after  pruning  the  node  labelled 
Long-term  Debt/Net  Worth,  which  had  the  smallest  value  of  a. 


Current  Ratio 


/ 

low  / 

/ 

Default 

(10)* 


\ 

\ high 

\ 

No  Default 
(16:6)** 


*:  10  examples  are  in  Default. 

**.  16  examples  are  in  No  Default,  6 examples  are  in  Default. 
Figure  2.6:  A Pruned  Decision  Tree  predicting  Loan  Default 


Each  of  pruned  trees  is  used  to  classify  an  independent 
test  set,  and  the  smallest  tree  with  a misclassif ication  rate 
within  one  standard  error  of  the  minimum  is  selected  as  the 
final  tree  (Breiman  et  al.  1984;  Mingers  1989b). 


42 


Using  the  same  hold-out  sample  of  Messier  and  Hansen 
(1988) , we  have  the  following  misclassif ication  rates: 

# of  nodes  in  the  pruned  tree  misclassif ication  rates 


6 

(Unpruned  tree:  Figure  2.2) 

2/16 

4 

(Figure  2.4) 

2/16 

3 

(Figure  2.5) 

2/16 

1 

(Figure  2.6) 

2/16 

0 

0 or  1 

*:  The  misclassif ication  rate  depends  on  the  labelling  rule 
since  there  are  equal  number  of  Default  and  No  Default 
examples . 

Here  we  choose  the  tree  with  only  a root  node  (Figure 
2.6)  as  the  best  pruned  tree.  The  above  hold-out  sample  may 
not  be  considered  as  a random  sample  since  these  are  all  in 
one  class,  the  Default  class.  In  general,  misclassif ication 
rates  show  the  near-convex  pattern  with  a random  hold-out 
sample.  (That  is,  the  misclassif ication  rate  decreases  as  we 
prune  a small  amount,  but  increases  as  we  prune  too  much) . 

Now  consider  the  Minimum-error  method  of  pruning.  The 
expected  error  rate  of  the  unpruned  node  is  calculated  by 
weighting  the  error  rates  of  each  branch  according  to  the 
proportion  of  examples  in  each  branch.  For  example,  if  we 
prune  the  node  Net  Income/Sales  in  Figure  2.2,  then  with  the 
number  of  classes  k = 2, 


E = (5-4+1)/ (5+2)  = 2/7. 


43 


If  we  do  not  prune  this  node,  the  expected  error  rate  is 

E = (1/5) (1-1+1)/ (1+2)  + (4/5) (4-4+1)/ (4+2)  = 1/5. 

Since  pruning  increases  the  expected  error  rate  we  do  not 
prune  the  node  Net  Income/Sales.  By  the  similar  calculations 
for  all  nonterminal  nodes  we  see  that  the  unpruned  tree  in 
2.2  has  the  smallest  expected  error  rate. 

Mingers  (1989b)  presented  empirical  comparisons  of  the 
above  five  post-pruning  methods  across  several  domains.  He 
used  four  selection  measures— the  G-statistic,  the  G-statistic 
with  Marshall's  correction,  the  probability  of  G from  the  Chi- 
square  distribution,  and  the  Gain-ratio  measure— to 

investigate  the  interaction  between  the  construction  and 
pruning  methods.  The  achievable  accuracy  differs  markedly 
between  domains,  depending  on  their  inherent  uncertainty.  He 
found  that,  for  most  domains,  pruning  improved  the  accuracy  by 
20%  to  25%.  The  result  shows  that  three  methods— error- 
complexity,  critical  value  and  reduced  error — perform  well, 
while  the  other  two  may  cause  problems  in  terms  of  prediction 
accuracy.  Minimum-error  pruning  produces  the  largest  trees, 
and  error-complexity  and  critical  value  produce  the  smallest 
decision  trees.  He  also  shows  that  there  is  no  significant 
interaction  between  the  construction  and  pruning  methods. 
Quinlan  (1987a)  assessed  the  performance  of  pruning  methods — 
error-complexity,  reduced  error,  and  pessimistic  pruning — in 
terms  of  the  clarity  and  accuracy  of  the  pruned  tree.  He  used 


44 


six  domains  to  test  including  both  real  world  tasks  and 
synthetic  tasks.  The  result  shows  that  error-complexity 
pruning  tends  to  produce  smaller  decision  trees  than  either 
reduced  error  or  pessimistic  pruning.  While  all  the  pruned 
tree  had  superior  or  equivalent  accuracy  compared  to  the 
unpruned  tree,  the  reduced  error  method  showed  slight 
superiority.  (For  other  empirical  results,  see  Niblett  and 
Bratko  1986,  and  Clark  and  Niblett  1987) . 

Fisher  and  Schlimmer  (1988)  suggest  that  the  benefits  of 
pruning  vary  with  the  amount  of  training  and  with  the 
statistical  dependence  of  the  concept  members  on  the  defining 
attributes . 


2.3  Focus  of  Research 

While  these  pruning  methods  have  reported  excellent 
empirical  results  in  terms  of  improving  the  accuracy  of  the 
learned  concepts,  those  results  depended  heavily  on  the 
specific  training  set  and  the  domains  over  which  they  applied. 
Whether  these  results  hold  in  general  and  to  what  extent 
pruning  improves  a concept  are  unknown. 

In  this  study  we  focus  on  various  theoretical  aspects  of 
pruning.  In  Chapter  3 we  propose  a particular  type  of  pruning 
and  determine  its  theoretical  effect.  This  is  combined  with 
recent  results  in  learning  theory  to  produce  a pruning  method 
that  will  yield  a concept  that  is  probably  approximately 


45 


correct  (PAC) . In  Chapter  4 we  focus  on  the  accuracy  of  a 
concept  learned  with  pruning.  In  Chapter  5 we  present 
conditions  under  which  pruning  is  useful  and  investigate  the 
reason  why  pruning  problems  arise.  Finally,  in  Chapter  6 we 
summarize  our  results  and  suggest  areas  for  future  research. 


CHAPTER  3 

PAC  LEARNING  A DECISION  TREE  WITH  PRUNING 


3 . 1 Introduction 


In  this  chapter  we  focus  on  a particular  type  of  pruning 
and  determine  its  theoretical  effect.  This  is  combined  with 
recent  results  in  learning  theory  to  produce  a pruning  method 
that  will  yield  a concept  that  is  probably  approximately 
correct  (PAC) . 

In  Section  3.2  we  present  notation  and  give  background 
material  for  constructing  a decision  tree  with  minimum  rank. 
A pruning  algorithm  is  motivated  and  presented  in  Section  3.3. 
In  Section  3 . 4 we  determine  a bound  on  the  error  due  to 
pruning.  In  Section  3.5  we  give  conditions  for  PAC 
identification  with  pruning.  In  Section  3.6  we  discuss 
additional  pruning  rules.  Finally,  in  Section  3.7  we 
summarize  our  results  and  suggest  areas  for  future  research. 

Many  algorithms  are  known  for  determining  decision  trees 
(Quinlan  1986;  Cheng  et  al.  1988;  Mingers  1989a;  Utgoff  1989). 
Recently,  Ehrenfeucht  and  Haussler  (1988)  presented  the  first 
PAC  learning  algorithm  for  binary  decision  trees  of  rank  at 
most  r (rank  is  defined  below)  . For  any  fixed  rank  r,  the 
number  of  random  examples  and  computation  time  required  in 


46 


47 


this  algorithm  is  polynomial  in  the  number  of  attributes  and 
linear  in  1/e  and  log  (1/<S). 

We  take  the  rank  as  a conciseness  measure  of  a decision 
tree  and  give  a pruning  algorithm  which  gives  the  least 
upperbound  of  a pruning  error.  We  also  determine  the  error 
introduced  by  pruning  and  give  the  required  sample  size  to 
guarantee  PAC-identif ication  with  accuracy  parameter  e and 
confidence  parameter  <5 . 

3.2  Binary  Decision  Trees 


3.2.1  Definitions 

We  begin  with  formal  definitions  of  binary  decision 
trees,  their  rank,  the  functions  they  represent  and  their 
error  in  an  instance  space.  Our  notation  is  similar  to  that 
given  by  Ehrenfeucht  and  Haussler  (1988). 

Definition  3.0  (A  reduced  binary  decision  tree)  : Let  Vn  = { v, , 
...,  vn}  be  a set  of  n Boolean  variables.  Let  Xn  = (0,l}n.  A 
binary  decision  tree  is  defined  as  follows: 

(i)  If  Q is  a tree  with  only  a root  labelled  either  0 or  1, 
then  Q is  a binary  decision  tree  over  Vn  (Below  we  abbreviate 
this  case  by  saying  "Q=0"  or  "Q=l") 

(ii)  Let  Q0  be  the  left  subtree  of  Q,  and  let  Qx  be  the 
right  subtree  of  Q.  if  the  root  node  v of  Q is  in  Vn  and  the 
left  subtree  Q0  (a  0-Subtree)  , and  right  subtree  Qx  (a  1- 


48 


Subtree)  are  binary  decision  trees,  then  Q is  also  a binary 
decision  tree. 

We  define  an  internal  node  i of  a decision  tree  Q as  a 
node  in  Q which  has  left  and  right  subtrees.  All  nodes  which 
are  not  internal  nodes  are  called  leaf  nodes  or  simply  leaves 
of  a decision  tree  Q.  We  say  that  an  internal  node  i is 
informative  if  it  has  only  leaves  and  the  leaves  have 
different  labels. 

We  say  that  a decision  tree  is  reduced  if  each  variable 
appears  at  most  once  in  any  path  from  the  root  to  a leaf. 

The  level  of  node  i is  the  number  of  predecessor  nodes 
from  the  root.  The  height  of  a decision  tree  Q is  defined  as 
the  maximum  of  the  levels  of  all  leaf  nodes  of  Q. 

A fully  labeled  tree  is  a tree  for  which  every  leaf  has 
a 0 or  1 label. 

A complete  binary  tree  is  a binary  decision  tree  where 
every  leaf  is  at  the  same  level. 

Definition  3.1  XA function  and  rank  of  a decision  tree) : A 

fully  labeled  binary  decision  tree  Q represents  a Boolean 
function  fg  defined  as  follows: 

(i)  If  Q = 0 then  fQ  is  the  constant  function  0 and  if  Q = 

1 then  fg  is  the  constant  function  1. 

(ii)  Else  if  is  the  label  of  the  root  of  Q,  then  for  any 
point  x=  (a1;  . . . , an)  e Xn,  if  a,=0  then  fg(x)  = fQo(x),  else 
fQ(x)=f0l  (x)  . 


49 


The  rank  of  a decision  tree  Q,  denoted  r(Q) , is  defined 
as  follows: 

(i)  If  Q=0  or  Q=1  then  r(Q)=0. 

(ii)  Else  if  r0  is  the  rank  of  the  O-subtree  of  Q and  rx  is 
the  rank  of  the  1-subtree,  then 

r (Q)  = ®ax(r0,  r:)  if  r0  # r, 

' ro  + 1 ( = rx  + 1 ) otherwise 
Let  Tnr  be  the  set  of  all  binary  decision  trees  over  Vn  of 
rank  at  most  r and  let  Fnr  be  the  set  of  Boolean  functions  on 
Xn  that  are  represented  by  trees  in  Tnr. 

The  error  of  a decision  tree  Q is  defined  below. 

Definition  3.2:  Let  f be  the  target  concept.  The  error  of  a 
decision  tree  Q,  denoted  e(Q)  or  e when  Q is  understood,  is 
defined  as  the  probability  of  all  x such  that  f(x)  * f0(x). 
That  is, 

e(Q)  = Prob  { x : f(x)  * fQ(x)  }. 

^ ♦ 2 — Finding — Consistent  Binary  Decision  Trees  with  Minimum 
Rank 

There  are  several  algorithms  for  finding  a decision  tree 
using  a training  set  S (see  Quinlan  1986,  Cheng  et  al . 1988, 
Mingers  1989a,  Spangler  et  al.  1989  and  Utgoff  1989).  Here 

we  focus  on  a PAC-learning  method  given  by  Ehrenfeucht  and 
Haussler  ( 1988 ) . 


50 


Definition  3.3:  An  example  of  a Boolean  function  f on  Xn  is  a 
pair  (x,f(x)),  where  x e Xn.  The  example  is  positive  if 
f(x)=l/  else  it  is  negative.  A sample,  S,  of  f is  a set  of 
examples  of  f.  |s|  denotes  the  number  of  examples  in  S.  A 
decision  tree  Q over  Vn  is  consistent  with  a sample  S if  for 
any  example  (x,f(x))  in  S,  f(x)=f0(x).  The  rank  of  a sample 
S,  denoted  r(S) , is  the  minimum  rank  of  any  decision  tree  that 
is  consistent  with  S.  We  say  a variable  v in  Vn  is 
informative  (on  S)  if  sample  S contains  at  least  one  example 
in  which  the  label  of  v is  0 and  at  least  one  example  in  which 
the  label  of  v is  1.  Let  S0V  be  the  set  of  all  examples 
(X/ f (x) ) such  that  the  label  of  v is  0,  and  let  S:v  be  the  set 
of  all  examples  (x,f(x))  such  that  the  label  of  v is  1. 


The  following  two  algorithms  can  be  used  to  determine  a 
decision  tree  consistent  with  a sample  S. 

Definition  3.4:  The  procedure  Find(S,k)  is  reproduced  from 
Ehrenfeucht  and  Haussler  (1988). 

Procedure  Find(S,k) 

Input:  A nonempty  sample  S of  some  Boolean  function  on  Xn  and 
an  integer  k,  n>k>0. 

Output:  A binary  decision  tree  of  rank  at  most  k that  is 

consistent  with  S if  one  exists,  else  "none". 


51 


1.  If  all  examples  in  S are  positive,  stop  and  return  the 
decision  tree  Q = 1;  if  all  examples  are  negative,  stop  and 
return  Q = 0. 

2.  If  k = 0,  stop  and  return  "none". 

3.  For  each  informative  variable  v in  Vn 

a.  Let  Q0V  = Find  (S0V,  k-l)  and  Qxv  = Find  (Sjv,  k-1)  . 

b.  If  both  recursive  calls  are  successful  (i.e.,  neither 
Q0  - none,  nor  Q:v  = none)  then  stop  and  return  the 
decision  tree  with  root  labeled  v,  0-subtree  Q0V  and  1- 
subtree  Qjv. 

c.  If  one  recursive  call  is  successful  but  the  other  is 
not , then 

i)  Re-execute  the  unsuccessful  recursive  call  with 
rank  bound  k instead  of  k-l,  (i.e.,  if  Q:v  is  a tree 
but  Q0V  = none  then  let  Q0V  = Find(S0v,k). 

ii)  If  the  re-executed  call  is  now  successful,  then 
let  Q be  the  decision  tree  with  root  labeled  v,  0- 
subtree  Q0V  and  1-subtree  Qxv,  else  let  Q = "none". 

iii)  Stop  and  return  Q. 

4.  Stop  and  return  "none". 

Definition  3.5:  The  procedure  Findmin(S)  is  reproduced  from 
Ehrenfeucht  and  Haussler  (1988). 

Procedure  Findmin(S) 

Input:  A nonempty  sample  S of  some  Boolean  function  on  X . 


52 


Output:  A minimal  rank  reduced  binary  decision  tree  of  S. 

1*  RePeat  Find (S , k)  for  k = 0,1,2,...  until  a decision  tree 
is  returned. 

2.  Stop  and  return  Q. 

Findmin() ' s performance  is  given  below. 

Theorem  3 . 6 (A_  Decision  tree  with  minimum  rank:  Ehrenfeucht 
and  Haussler, — 1988 ) : Given  a sample  S of  a Boolean  function  on 
Xn , using  Findmin(S)  we  can  produce  a decision  tree  that  is 
consistent  with  S and  has  rank  r(S)  in  time  0 ( | S | (n+1)  2r(S))  . 

We  will  use  Findmin()  in  our  approach. 

— Pruning  a Consistent  Decision  Tree  to  a Desired  Rank 

In  this  section  we  present  a pruning  algorithm  that  will 
be  used  with  Findmin()  to  give  certain  theoretical  results. 
We  begin  with  definitions  of  pruning  and  labelling. 

3.3.1  Definitions 

There  are  many  ways  to  prune  a decision  tree.  We  first 
address  what  we  mean  by  pruning. 

Definition  3.7:  Pruning  a decision  tree  Q at  node  v is  defined 
as  removing  the  left  and  right  subtrees  of  node  v,  and 


53 


labelling  the  node  (now  a leaf)  according  to  some  labelling 
rule . 


There  are  many  possible  labelling  rules.  For  most  of 
this  chapter  we  focus  on  deterministic  methods  for  labelling. 
Later,  we  will  look  at  two  nondeterminist ic  rules. 


Definition  3.8  (Deterministic  Labelling) : Let  i be  an  internal 
node  in  a decision  tree  Q,  and  Q(i)  be  a subtree  of  Q such 
that  node  i is  the  root  of  the  tree  Q(i). 

1.  Sample  labelling: 

Let  s(i)  be  a subset  of  the  sample  S used  to  construct  Q 
from  its  root  to  node  i. 

Let  s0(i)  denote  the  number  of  negative  examples  in  s(i) 
and  s:(i)  denote  the  number  of  positive  examples  in  s(i).  If 
we  prune  Q at  i , then 

(i)  If  s0(i)  > s^i),  label  i as  0. 

(ii)  If  s0(i)  < Sjfi),  label  i as  1. 

2.  Tree  labelling: 

• • • / vip  be  the  nodes  in  the  path  from  the  root  to 

the  parent  of  node  i in  Q.  And  let  a , . . . , a,  be  the  labels 

P 

of  each  node,  where  i1(...,ip  are  p distinct  indices  of 
{ 1 , . . . , n } . 

Let  1 o ( i ) = | {x:  f0(x)=0}  | , where  x e Xn  such  that 


are  the  same. 


54 


Let  lx(i)  - | { x : fQ(x)=l}|,  where  x e Xn  such  that 

ai-i  > • • • , ai  are  the  same.  If  we  prune  Q at  i,  then 

(i)  If  l0(i)  > lj, ( i ) , label  i as  0. 

(ii)  If  l0(i)  < 1,(1),  label  i as  1. 

Below  we  give  an  illustrative  example  where  the  above  two 
methods  give  a different  labelling. 

Consider  the  case  of  an  instance  space  over  two  Boolean 
variables.  Further  suppose  we  have  five  training  examples. 
In  Table  3.1  below,  n()  denotes  the  number  of  examples  for 
each  point  x.  fQ()  denotes  the  boolean  function  of  the 
concept  learned  by  constructing  a consistent  decision  tree 
from  these  training  examples. 

Table  3.1:  A Decision  Tree  with  One  Node 


x n (x)  fQ(X) 

(1,1)  1 1 

(1,0)  o l 

(0,1)  1 i 

(0,0)  3 0 


Suppose  we  prune  Q at  the  root.  If  we  label  the  root 
node  by  the  sample  labelling  rule,  the  label  of  the  root  is  0 
since  s0()=3  > sx  ( ) =2  . 

However,  if  we  use  the  tree  labelling  rule,  the  label  of  the 
root  will  be  1 since  l0()=i  < 11()=3. 


55 


3.3.2  Pruning 

We  now  consider  the  following  problem:  Given  a 

consistent  decision  tree  with  rank  k,  produce  a pruned 
decision  tree  with  rank  r ( k > r ) . We  want  a pruned  tree  to 
have  the  least  amount  of  error  due  to  pruning.  We  present  an 
algorithm  that  solves  this  problem  in  time  0(2k+r+|S|)  using 
sample  labelling  and  0(2k+r+(en/k)k)  using  tree  labelling. 

For  any  decision  tree  Q of  rank  k>r,  it  is  easily  seen 
that  one  of  the  following  mutually  exclusive  cases  must  hold. 
Case  1:  r(Q0)  = k and  r(Qj)  < r 

Case  2:  r(Q0)  = k and  r < rtQJ  < k 

Case  3:  r(Q0)  < r and  r(Q1)  = k 

Case  4:  r < r(Q0)  < k and  r(Qx)  = k 

Case  5:  r(Q0)  = r(QJ  = k-1 

For  a given  decision  tree  of  rank  k,  there  may  be  many 
ways  to  prune  it  to  rank  r.  Since  our  objective  is  to 
minimize  the  pruning  error,  one  possible  strategy  is  to  prune 
the  least  number  of  nodes  of  a consistent  decision  tree  Q.  In 
other  words,  we  will  prune  Q to  the  largest  subtree  having  the 
desired  rank.  The  idea  of  our  pruning  algorithm  is  as 
follows.  In  Cases  1 and  3,  pruning  only  one  subtree  with  rank 
k to  rank  r is  enough  to  form  a pruned  decision  tree  with  rank 
r.  Pruning  the  other  subtree  with  rank  less  than  r to  a lower 
rank  is  not  needed  since  the  resulting  tree  will  still  have 
same  rank.  Similarly,  using  this  minimum  node  pruning  rule  we 


56 


have  two  possible  alternative  ways  of  pruning  in  Cases  2 and 
4 . 

(i)  Prune  one  subtree  with  rank  k to  rank  r— 1 and  prune  the 
other  subtree  to  rank  r. 

(ii)  Prune  one  subtree  with  rank  k to  rank  r and  prune  the 
other  subtree  to  rank  r-1. 

In  the  algorithm  shown  below,  we  use  method  (i)  and  in 
Section  3.4,  we  show  that  this  method  gives  the  least  pruning 
error. 

Finally,  in  Case  5,  we  will  arbitrarily  prune  the  tree  by 
giving  the  preference  to  the  0-subtree. 

3.3.3  A Pruning  Algorithm 

Below  is  pruning  algorithm  Prune () . This  algorithm  can 
prune  any  binary  tree.  However,  we  restrict  our  initial  input 
to  trees  produced  by  Findmin(S)  in  order  to  later  guarantee 
PAC  identification.  In  the  following  Q - p means  Q is 
replaced  by  P. 

Procedure  Prune (r, k, Q, S) 

Input:  A decision  tree  Q with  rank  at  most  k,  a sample  S 
and  an  integer  r such  that  0<r<k. 

Output  : A decision  tree  with  rank  at  most  r. 

Let  s0(Q,S)  = the  number  of  examples  in  S such  that  fQ(x)=0. 
Let  s:(Q,S)  = the  number  of  examples  in  S such  that  fQ(x)=l. 
Let  10(Q)  = the  number  of  instances  x e Xn  such  that  fQ(x)=0. 


57 


Let  li(Q)  - the  number  of  instances  x e Xn  such  that  fg(x)=l. 
!•  If  r(Q)  ^ r,  then  stop  and  return  Q. 

2.  If  r=0,  then  stop  and  label  Q by  the  appropriate  labelling 

rule : 

1)  Sample  labelling: 

Q=0  if  s0(Q,S)  > Sj(Q,S),  otherwise  Q=l. 

2)  Tree  labelling: 

Q=0  if  10(Q)  > lx  (Q) , otherwise  Q=l. 

Return  Q. 

3.  Let  k0  = r(Q0)  , kj  = rfQJ  . 

Let  S0  be  the  set  of  all  examples  (x,f(x))  in  S such  that 
x_  ( ai / • • • / an)  and  a^O,  where  v1  e Vn  is  the  root  of  Q. 
Let  Sx  be  the  set  of  all  examples  (x,f(x))  in  S such  that 
x~(ai/  • • • fan)  and  a,=l , where  vx  e Vn  is  the  root  of  Q. 


Case 

1: 

If  k0  = k 

and  kx  < r 

then  Q0  *- 

Prune  (r,k0,Q0,s0)  . 

Case 

2 : 

If  k0  = k 

and  r < kx  < k 

then  Q:  <- 

Prune(r,k1,Q1,S1)  , 

Qo 

- Prune  (r-l,k0,Q0/ S0) 

Case 

3 : 

If  k0  < r 

and  kx  = k 

then  Qx  <- 

Prune  (r,k1,Q1,S1) 

• 

Case 

4 : 

If  r < k0 

< k and  kx  = k 

then  Q0  <- 

Prune  (r,k0,Q0,S0)  , 

Qi 

- Prune(r-l,k1,Q1,S1) 

Case 

5: 

If  k0  = k: 

= k-1 

then  Q0  <- 

Prune  ( r , k0 , Q0 , S0 ) , 

Qi 

- Prune(r-l,k1,Q1,S1) 

4.  Stop  and  return  Q. 


58 


To  obtain  our  theoretical  results  it  will  be  useful  to 
have  another  pruning  algorithm.  This  alternative  pruning 
method  prunes  one  subtree  of  rank  k to  rank  r and  prunes  the 
other  subtree  to  rank  r-1.  We  define  W-Prune()  , as  this 
alternative  pruning  algorithm. 

Procedure  W-Prune (r, k,Q, S) 

Input:  A decision  tree  Q with  rank  at  most  k,  a sample  S 
and  an  integer  r such  that  0<r<k. 

Output  : A decision  tree  with  rank  at  most  r. 

Let  s0(Q,S)  = the  number  of  examples  in  S such  that  fQ(x)=0. 
Let  Sl(Q,S)  = the  number  of  examples  in  S such  that  f0(x)=l. 
Let  10(Q)  = the  number  of  instances  x e Xn  such  that  fQ(x)=0. 
Let  lj(Q)  = the  number  of  instances  x e Xn  such  that  fQ(x)=l. 

1.  If  r(Q)  < r,  then  stop  and  return  Q. 

2*  If  r=0 , then  stop  and  label  Q by  the  appropriate  labelling 
rule : 

1)  Sample  labelling: 

Q=0  if  s0(Q,S)  > s1(Q,S),  otherwise  Q=l. 

2)  Tree  labelling: 

Q=0  if  10(Q)  > 1 j ( Q)  , otherwise  Q=l. 

Return  Q. 

3.  Let  k0  = r(Q0)  , k,  = rCQJ  . 

Let  S0  be  the  set  of  all  examples  (x,f(x))  in  S such  that 
x=  (ax , . . . , an)  and  a^O,  where  v1  e Vn  is  the  root  of  Q. 


59 


Let  Si  be  the  set  of  all  examples  (x,f(x))  in  S such  that 
(ai»  • • • /an)  and  ai=l , where  v 1 e Vn  is  the  root  of  Q. 
Case  1:  If  k0  = k and  kx  < r 
then  Q0  - W-Prune  (r , k0 , Q0 , S0)  . 

Case  2:  If  k0  = k and  r < k:  < k 

then  Q:  - W-Prune (r-1, kx/ Qx, Sx)  , Q0  - W-Prune  (r,k0,Q0,s0) 
Case  3:  If  k0  < r and  kx  = k 

then  Qa  - W-Prune  (r,  klt  Q1#  Sx)  . 

Case  4:  If  r < k0  < k and  kj  = k 

then  Q0  - W-Prune  ( r-1 , k0 , Q0 , S0)  , Qx  - W-Prune  (r,  klf  Q:,  Sx) 
Case  5:  If  k0  = kj  = k-1 

then  Q0  *-  W-Prune  ( r , k0 , Q0 , S0)  , Qx  W-Prune  ( r-1 , kx , Qj , Sx) 
4.  Stop  and  return  Q. 

Lemma  3.9  shows  that  Prune ( r , k , Q , S)  and  W-Prune ( r, k, Q , S) 
always  give  a pruned  tree  with  the  desired  rank  at  most  r. 

Lemma  3.9:  The  procedures  Prune (r , k , Q , S ) and  W-Prune ( r , k, Q, S) 
are  correct. 

Proof:  The  proof  is  by  induction  on  r.  If  r=0  and  k>0  then 
Prune ()  and  W-Prune  ( ) return  Q=0  or  Q=l.  Hence,  r(Q)  = o = r. 

Now  suppose  Prune  ()  and  W-Prune  ()  run  correctly  on  all 
cases  requiring  rank  r'  < r-1  and  k > 0.  In  each  of  the  five 
cases  treated  within  Prune ()  and  W-Prune  (),  either  a tree  of 
rank  less  than  or  equal  to  r-1  is  to  be  produced,  or  a tree  of 


60 


rank  at  most  r is  to  be  produced.  By  assumption,  for  those 
requiring  rank  less  than  or  equal  to  r-1,  the  algorithm  will 
work  correctly.  So  we  turn  our  attention  to  the  case 
requiring  rank  at  most  r. 

Without  loss  of  generality,  suppose  Q0  requires  rank  at 
most  r.  Prune  (r,k0,Q0,S0)  and  W-Prune  (r , k0 , Q0 , S0)  will  operate 
on  a tree  with  one  fewer  variable  than  Q.  Since  the  tree  is 
reduced,  the  number  of  variables  used  in  Q0  from  its  root  to 
leaves  must  be  less  than  or  equal  to  n-1.  Since  only  one 
subprocedure  requires  a tree  with  rank  at  most  r in  each 
recursion,  we  divide  that  subprocedure  into  subprocedures 
until  the  subprocedure  requiring  rank  at  most  r will  be 
operated  on  a tree  with  rank  less  than  or  equal  to  r.  So, 
ultimately  in  the  recursion,  all  subprocedures  except  one 
require  trees  with  rank  less  than  or  equal  to  r-1,  and  the 
subprocedure  requiring  a tree  with  rank  at  most  r will  be 
operated  on  a tree  with  rank  less  than  or  equal  to  r since  the 
number  of  variables  used  in  the  tree  is  reduced  enough  to 
guarantee  that  the  rank  of  the  tree  is  less  than  or  equal  to 
r.  If  r (Q)  < r,  by  Step  1,  the  Prune ( ) and  W-Prune ()  return 
a tree  with  rank  at  most  r correctly.  Since  Prune ( ) and  W- 
Prune ( ) are  correct  for  all  other  subprocedures  requiring  rank 
less  than  or  equal  to  r-1,  by  induction,  the  procedure 
Prune (r , k, Q, S)  and  W-Prune ( r , k , Q , S ) return  a pruned  tree  with 


rank  at  most  r. 


61 


Therefore,  the  procedure  Prune ( r, k, Q , S) 
Prune(r,k,Q,S)  are  correct  for  all  0<r<k. 

□ 


and  W- 


Lemma  3.10  will  be  used  to  prove  the  time  complexity  of 
the  procedure  Prune () . 

Lemma  3.10  (An  upperbound  on  the  number  of  nodes  in  a reduced 
decision  tree  over  V„:  Ehrenfeucht  and  Haussler.  1988^:  Let  j 
be  the  number  of  nodes  in  a reduced  decision  tree  over  Vn  of 
rank  k,  where  n>k>l.  Then 
j < 2 ( en/k)  k , 

where  e is  the  base  of  the  natural  logarithm. 

Lemma  3.11:  For  any  nonempty  tree  Q with  rank  k>0,  the  time  of 
Prune  (r,  k,  Q,  S)  is  0(2k+r+|s|)  for  sample  labelling,  and 
0 ( 2k+r+  ( en/k) k)  for  tree  labelling. 

Proof:  Prune ()  will  stop  either  when  r=0  or  when  k reduces  to 
r.  For  each  call  of  Prune  ( ) , r or  k will  be  reduced  by  at 
least  one  and  at  most  two  new  recursive  calls  of  Prune ( ) are 
made.  Since  we  do  not  have  subprocedures  if  r+k  < 2 , there 
are  at  most  r+k-2  steps.  So,  the  total  number  of  calls  of 
Prune  ()  will  be  at  most  2k+r'1.  For  each  call  of  Prune  ()  , there 
are  a constant  number  of  unit  operations  (mainly,  comparisons) 
except  for  the  labelling  time.  The  time  for  the  labelling 


62 


steP  (Step  2)  depends  on  the  labelling  rule.  If  we  label  a 
pruned  node  using  sample  labelling,  at  most  |s|  additional 
time  is  needed  since  S0  and  Sj  in  the  Prune  ( ) are  disjoint 
subsets  of  S and  labelling  occurs  at  most  once  in  each 
subprocedure.  If  we  label  a pruned  node  using  tree  labelling, 
at  most  (en/k)k  additional  time  is  needed  since  Q0  and  Q:  are 
disjoint  subtrees  of  Q and  labelling  occurs  at  most  once  in 
each  subprocedure.  Also,  the  number  of  leaf  nodes  which  we 
need  to  scan  to  label  the  node  is  less  than  or  egual  to 
(en/k)k  by  Lemma  3.10.  So,  the  time  of  Prune  (r,  k,  Q,  S)  is 

0 ( 2k+r+  | S | ) for  sample  labelling,  and  0 (2k+r+ (en/k)  k)  for  tree 
labelling . 

□ 

Lemma  3.11  gives  the  time  complexity  of  this  algorithm. 
If  we  use  the  sample  labelling  rule,  the  time  required  for 
pruning  is  independent  of  the  number  of  variables,  n,  and  far 
less  than  the  tree  construction  time  of  Findmin()  in  Theorem 
3.6  since  n > 1,  k > r and  |S|  > 1.  If  we  use  the  tree 
labelling  rule,  the  time  required  for  pruning  is  polynomial  in 
n for  fixed  k and  less  than  the  time  of  Findmin()  since  |s|  > 


1. 


63 


3.4  Determining  the  Pruning  Error 

To  determine  a PAC  learning  result  for  pruning,  we  need 
to  bound  the  pruning  error.  Here  we  give  a formal  definition 
of  the  pruning  error  and  the  least  upperbound  of  the  pruning 
error  for  a given  rank  tree.  We  can  define  the  pruning  error 
over  the  training  sample  or  over  the  whole  instance  space. 
Here  we  use  the  whole  instance  space.  The  pruning  error  over 
the  training  sample  will  always  be  positive  since  we  are 
assuming  that  the  starting  decision  tree  is  reduced  and  is 
consistent  with  the  training  sample.  However,  if  we  define 
the  pruning  error  over  the  instance  space,  we  may  get 
diffeF6nt  results  since  the  pruned  tree  may  perform  better 
than  the  unpruned  tree  as  shown  in  many  studies  (Quinlan 
1987a;  Mingers  1989b). 

Definition  3.12:  Given  a decision  tree  P(Q)  formed  by  pruning 
a decision  tree  Q produced  by  Findmin() , the  pruning  error 
e (Q, P (Q) ) , is  defined  as  the  difference  between  the  error  of 
P(Q)  and  the  error  of  Q. 
e(Q,P(Q) ) = e ( P (Q) ) - e(Q) 

Prob  { x : f(x)  * fP(Q,(x)  } - Prob  { x : f(x)  * fQ(x)  ). 

For  an  arbitrary  probability  distribution  D and  any 
decision  tree  Q in  which  all  nodes  are  informative,  we  can 
easily  verify  that  the  pruning  error  is  in  the  range  (-1,1) 


64 


since,  by  definition,  the  pruning  error  is  the  difference  of 
two  probabilities,  where  e(Q)  is  in  the  range  [0,1)  and 
e(P(Q))  is  in  the  range  (0,1).  Since  we  assume  that  all 
examples  are  drawn  according  to  D without  error  and  Findmin(S) 
guarantees  that  all  internal  nodes  having  only  leaves  in  Q are 
informative,  e(P(Q))  must  be  positive.  e(P(Q))  cannot  be  1 
since  P(Q)  cannot  be  the  null  tree.  So  the  pruning  error 
cannot  reach  boundary  points  -1  or  1.  Below  we  show  that  both 
positive  and  negative  cases  of  e(Q,P(Q))  can  be  realized. 


3-4.1 Possible  Cases  of  the  Pruning  Error 

Consider  the  case  of  an  instance  space  over  three  Boolean 
variables.  Further  suppose  we  have  a target  concept  f()  and 
probability  distribution  D()  as  shown  in  Tables  3.2  and  3.3 
below.  Suppose  we  take  10  training  examples  under  D() . In 
Table  3.2  and  3.3  below,  n()  denotes  the  number  of  examples 
for  each  point  x.  fg()  denotes  the  boolean  function  of  the 
concept  learned  by  constructing  a consistent  decision  tree 
from  these  training  examples.  By  pruning  the  decision  tree  Q, 
we  get  a pruned  decision  tree  P(Q)  . fP(Q)()  denotes  the 
boolean  function  of  the  concept  learned  from  the  pruned 
decision  tree.  We  label  each  pruned  node  by  the  sample 
labelling  rule. 


65 


Table  3.2 

X 

: Positive 

f (X) 

error 

D(x) 

case 
n (x) 

fg(x) 

fp(g>  (x) 

(1,1,1) 

0 

. 001 

0 

1 

1 

(1,1,0) 

1 

. 001 

3 

1 

1 

(1,0,1) 

1 

. 001 

0 

0 

0 

(1,0,0) 

0 

. 001 

2 

0 

0 

(0,1,1) 

0 

. 001 

2 

0 

0 

(0,1,0) 

1 

. 001 

0 

0 

0 

(0,0,1) 

1 

. 993 

2 

1 

0 

(0,0,0) 

0 

. 001 

1 

0 

0 

In  this 

decision  tree,  r(Q)  = 

2 and 

o 

II 

The  pruning  error 
Prob  { x : f(x)  * 

for  this  case  is 
fp<Q)(x)  ) - Prob  { 

x : f (x) 

= (.001 

+ .001  + . 

001  + 

. 993) 

- ( . 001  + . 001  ■ 

= .993 
Table  3.3: 

X 

Negative 
f (x) 

error  case 
D(x)  n (x) 

fg(x) 

fp(g;  (x) 

(1,1,1) 

0 

. 001 

0 

1 

1 

(1,1,0) 

1 

. 001 

3 

1 

1 

(1,0,1) 

1 

. 001 

0 

0 

0 

(1,0,0) 

0 

. 001 

2 

0 

0 

(0,1,1) 

0 

. 001 

1 

0 

1 

(0,1,0) 

1 

. 993 

0 

0 

1 

(0,0,1) 

1 

. 001 

3 

1 

1 

(0,0,0) 

0 

. 001 

1 

0 

1 

fg(x)  } 
. 001) 


66 


In  this  decision  tree,  r(Q)  = 2 and  r(P(Q))  = 1. 

The  pruning  error  for  this  case  is 

Prob  { x : f(x)  * fp(Q>(x)  } - Prob  { x : f(x)  * fQ(x)  } 

= (.001  + .001  + .001  + .001)  - (.001  + .001  + .993) 

= -.991 


■3  • 4 . 2 — The Least  Upperbound  of  the  Pruning  Error 

Here  we  develop  a series  of  lemmas  which  we  need  to 
determine  the  least  upperbound  of  the  pruning  error.  This  is 
used  to  guarantee  PAC  identification  of  a learning  algorithm. 
We  begin  with  a formal  definition  of  the  least  upperbound  of 
the  pruning  error. 

Let  Wn  denote  the  set  of  all  fully-labeled,  reduced, 
binary  trees  of  rank  r over  Vn,  in  which  all  variables  are 
informative  and  all  internal  nodes  which  have  only  leaves  are 
informative.  Since  we  are  using  Findmin()  which  finds  a 
decision  tree  QeW nr,  we  need  only  consider  QeW nr.  However, 

much  of  the  following  analysis  can  be  directly  applied  to 
other  situations. 

Definition  3.13:  The  least  upperbound  of  the  pruning  error, 
denoted  77 k r/  for  given  rank  k,  and  target  r is  defined  as 
follows : 

Let  ^k.r  = sup  inf  e (Q,  P)  . 

QeWnk  Pep(Q) 

For  any  Q e wnk,  p(Q)  is  the  set  of  all  subtrees  of  Q of  rank 


r pruned  from  Q. 


67 

For  convenience,  we  define  r?r  s rjr  rn. 

From  earlier  discussions,  we  know  that  r?k,r  can  be  equal 
to  1 for  an  arbitrary  distribution  D.  Since  the  population 
distribution  is  usually  unknown  for  real  world  situations,  it 
is  difficult  to  characterize  r?k  r further.  So,  without  some 
restrictions  on  distribution  D,  we  cannot  improve  or 
characterize  the  upperbound  of  the  pruning  error.  However, 
for  totally  unknown  population  distributions,  we  may  assume  an 
equally  likely  distribution  (i.e.,  a uniform  distribution),  or 
we  may  assume  that  the  sampling  distribution  is  equal  to  the 
population  distribution.  Here  we  start  with  the  assumption  of 
an  equally  likely  distribution  D to  determine  the  error  of 
pruning.  The  pruning  error  also  depends  on  the  labelling 
rule.  Here  we  use  the  tree  labelling  rule.  As  will  be  shown, 
these  assumptions  will  guarantee  that  the  least  upperbound  of 
the  pruning  error  is  positive  and  less  than  or  equal  to  0.5. 
In  the  following  we  give  several  results  characterizing  rjk  r 
under  these  assumptions. 

To  facilitate  our  effort  to  determine  r7k  r,  we  will  need 
to  define  a Maximal  tree.  Let  Q(j)  be  a subtree  of  Q with  an 
internal  node  j of  Q as  its  root  node. 

Definition  3.14:  A Maximal  tree  of  a reduced  decision  tree  Q 
with  rank  r for  given  n,  n>r,  is  defined  as  follows: 

1)  If  n = r,  then  Q is  a complete  binary  tree  with  height  n. 


68 

2)  If  n > r,  then  for  each  internal  node  j, 

i)  if  the  level  of  node  j is  less  than  n - r(Q(j)),  then 
one  of  the  subtrees  of  Q ( j ) has  rank  r (Q  ( j ) ) and  the 
other  subtree  has  rank  r(Q(j))-l;  or 

ii)  if  level  of  node  j = n - r(Q(j)),  then  Q(j)  is  a 
complete  binary  tree  with  height  r(Q(j)). 

Let  ^ denote  the  set  of  all  maximal  trees  of  rank  r over  Vn. 

Based  on  the  above  definition,  it  can  be  easily  seen  that 
the  number  of  nodes  are  equal  for  all  maximal  trees  of  the 
same  rank  for  given  n,  and  the  number  of  nodes  in  a maximal 
tree  increases  as  the  rank  increases  since  a maximal  tree  QeM^ 
can  always  be  found  by  deleting  at  least  one  node  from  a 
higher  rank  maximal  tree  Q'eMnr+1. 

Under  the  assumptions  of  a uniform  distribution  and  the 
tree  labelling  rule,  Lemma  3.15  and  Lemma  3.16  characterizes 

Hk.r- 

Lemma  3.15:  When  D is  a uniform  distribution  and  Prune ()  uses 
the  tree  labelling  rule  and  is  given  QeW nk  as  input,  the  least 
upperbound  of  the  pruning  error,  rjk  r,  is  strictly  positive  and 
less  than  0.5  when  r>0.  rjk  r = 0.5  only  when  r=0. 

Proof:  Let  Q e Wnk.  Suppose  some  internal  node  "i"  in  Q is  at 
level  m (m<n)  . Further  suppose  that  we  prune  Q to  P(Q)  at 
node  "i"  leaving  "i"  a leaf  node  in  P(Q).  Let  p("i")  be  the 


69 


probability  of  instances  covered  by  the  leaves  of  the  subtree 
Q(i)  in  Q.  Under  a uniform  distribution,  p("i")  = 2n~m/2n  = 
( 0 . 5 ) m.  Let  e("i")  be  the  probability  of  incorrect 

classifications  in  the  instances  covered  by  the  leaves  of 
subtree  Q(i)  over  the  whole  instance  space. 

By  the  definition  of  l0(i)  and  1 x ( i ) in  Definition  3.8, 
lo  ( i)  +li  ( i)  — 2 m.  Suppose  a0  (i)  is  the  number  of  incorrect 
classifications  of  l0(i)  and  o:1(i)  is  the  number  of  incorrect 
classifications  of  lx ( i)  . Then  clearly 
e ( "i" ) = a0  ( i ) / 2n  + a1(i)/2n 
where  0 < a0(i)  < l0(i)  and  0 < ax(i)  < lj(i). 

If  we  prune  Q at  node  "i",  then  in  P(Q)  the  label  of  the 
leaf  node  "i"  is  either  0 or  1.  If  we  label  the  node  "i"  by 
the  "tree  labelling"  rule,  then  node  "i"  will  have  the  label 
which  covers  the  larger  number  of  instances.  If  l0(i)  > l:(i), 
then  the  label  of  the  node  i is  0 and 

e(»i")  = Q0(i)/2n  + 1:  (i)  /2n  - ( i ) /2n . 

By  the  definition  of  the  pruning  error, 
e(Q,P(Q))  = e(P(Q) ) - e(Q) 

= [ o:0(i)/2n  + l1(i)/2n  - a1(i)/2n  ] - [ a0(i)/2n  + a:(i)/2n  ] 

= li(i)/2n  - 2a1(i)/2n  < l1(i)/2n  < (l/2)(0.5)m 
since  l:(i)  < 2n'm'1. 

Similarly,  if  l0(i)  < lx(i) , then  the  label  of  the  node  i is 
1 and 


e("i") 


a1(i)/2n  + 10  ( i ) / 2n  - a0(i)/2n. 


70 

e(Q,P)  = l0(i)/2n  - 2a0  ( i ) / 2n  < l0(i)/2n  < (l/2)(0.5)m 
since  l0(i)  < 2n'ni~1. 

In  both  cases,  the  pruning  error  does  not  exceed  half  of 
p("i").  In  other  words,  the  leaf  node  "i"  in  P(Q)  correctly 
classifies  at  least  half  of  the  corresponding  instances. 
Since  this  is  true  for  all  pruned  nodes,  the  pruned  tree  P(Q) 
correctly  classify  at  least  a half  of  total  instances. 
Therefore,  the  pruning  error  e(Q,P(Q))  < o.5. 

Further,  l0(i)  and  lj(i)  are  nonzero  for  any  nodes  in  Q 
e Wnk  since  all  nodes  in  Q are  informative.  Hence,  e(Q,P(Q)) 
> 0 and  rjk.r  > 0 for  all  k>r>0. 

If  r>0,  at  least  one  subprocedure  of  Prune  ()  stops  at 
step  1 since  Prune  ( ) reduces  k only  in  one  of  its 
subprocedures  and  eventually  k reduces  to  less  than  or  equal 
to  r.  In  that  subprocedure,  Prune  ( ) does  not  prune  that 
subtree  of  Q.  So  the  probability  over  the  instance  space  for 
which  pruning  is  needed  in  Q is  strictly  less  than  1.  Hence, 
e(Q,P(Q))  < 0.5  and  r? kr  < 0.5. 

r~°,  then  the  probability  over  the  instance  space  for 
which  pruning  is  needed  in  Q is  equal  to  1.  Hence,  e(Q,P(Q)) 

- 0.5  and  f7k,r  — 0.5  can  only  be  realized  when  the  probability 
of  all  the  negative  examples  and  that  of  all  the  positive 
examples  are  equal,  and  when  Q has  no  error  over  the  instance 
space,  for  all  k>r=0. 

□ 


71 


Lemma  3.16:  Let  Q e Wnk  and  P e p(Q).  The  following  are  true. 

a)  The  pruning  error,  e(Q,P),  is  a nondecreasing  function 
of  the  amount  of  pruning,  k-r. 

b)  ^k.r  is  attained  at  a maximal  tree  P(Q)  in  M/. 

c)  Hk.r  is  a nonincreasing  function  of  r. 

Proof:  a)  Suppose  that  an  internal  node  "a"  in  Q is  at  level 
m (men)  with  left  child  node  "b"  and  right  child  node  "c"  each 
at  level  m+1.  Under  a uniform  distribution,  the  probability 
of  instances  covered  by  node  "a"  at  level  m is  (0.5)m.  Let 
e ( ) be  the  probability  of  an  incorrect  classification  in 

the  instances  covered  by  the  leaves  of  the  subtree  Q(i). 

Suppose  P (Q)  and  P' (Q)  are  two  pruned  trees  of  Q,  where 
P(Q)  is  found  by  pruning  at  nodes  "b"  and  "c"  of  Q and  P' (Q) 
is  found  by  pruning  at  node  "a"  of  Q.  Further  suppose  e("b") 

- a and  e("c")  = y at  level  m+1.  Using  the  same  arguments  as 
in  the  proof  of  Lemma  3.15,  0 < a,y  < (0.5)ra+2.  Then 

e ( "a" ) = e ( "b" ) + e("c")  = a + y 
in  P (Q) . 

But  if  we  prune  Q at  node  "a",  the  label  of  the  leaf  node 
"a"  is  either  0 or  1.  If  the  label  of  "a"  is  0,  then  in 

P'  (Q)  , 

e ( "a" ) = a + ( (0.5)m+1  - y) 

= ( 0 . 5 ) ra+1  + a - y. 


m+ 1 


So,  [ e ( "a" ) in  P' (Q) ] - [e("a")  in  P(Q)] 

= ( ( 0 . 5 ) m+1  + a - Y)  - (a  + y)  = (0.5) 


2y  > 0 


72 


since  y < (0.5)m+2. 

Similarly,  if  the  label  of  "a"  is  1, 

[e ( "a" ) in  P'(Q)]  - [e("a")  in  P(Q) ] 

= ( (0.5) m+1  + y - a)  - (a  + y)  = (0.5)m+1  - 2a  > 0 
since  a < (0.5)m+2. 

So,  the  error  of  the  subtree  at  node  "a"  in  P'  (Q)  is 
greater  than  or  equal  to  that  of  P(Q) . Since  all  other  nodes 
are  equal  except  node  "a"  in  P(Q)  and  P'(Q),  e(P'(Q))  > 

e ( P (Q)  ) . Hence,  since  e(Q)  is  equal  in  both  cases,  the 
pruning  error,  e(Q,P),  is  a nondecreasing  function  of  the 
amount  of  pruning. 

b)  By  the  definition  of  a maximal  tree,  for  given  n and 
r,  any  nonmaximal  tree  has  less  nodes  than  a maximal  tree  of 
equal  rank.  To  find  a nonmaximal  tree  of  desired  rank  by 
pruning  Q,  we  need  to  prune  more  nodes  than  the  amount  needed 
when  we  find  a comparable  maximal  tree  of  equal  rank.  So,  the 
least  amount  of  pruning  occurs  when  we  prune  Q to  a maximal 
tree  P(Q)  of  rank  r.  From  (a),  the  least  upperbound  of  the 
pruning  error,  r? k r,  is  attained  at  a maximal  tree  P(Q)  in  M/. 

c)  From  (b)  , r?k  r occurs  at  a maximal  tree.  Since  a 
maximal  tree  with  rank  r+1  has  more  nodes  than  a maximal  tree 
with  rank  r for  a given  number  of  attributes  n,  we  need  to 
prune  more  nodes  when  we  prune  a tree  with  rank  k to  a tree 
with  rank  r than  to  a tree  with  rank  r+1.  So,  r?k  r+1  < r/k  r for 
all  r . 

□ 


73 


Lemma  3.17  (Characterization  of  nv  - ) : 

T?k  0 = 0.5  for  k>0. 

^k.i  = 0.5  ( 1 - (0.5)k  x)  for  k>l. 

Hk.z  = 0.5  ( 1 - ( 0 . 5 ) k"2)  - 2 (k-2 ) ( 0 . 5)  k+2  for  k>2  . 

Hk.3  = 0.5  ( 1 - ( 0 . 5) k’3)  - (k-3)  (k+8)  (0. 5)k+3  for  k>3. 

r?k,4  = 0.5  ( 1 - (0.5)k'*)  - (k-4  ) { (k2+13k+8  4 ) / 3 } ( 0 . 5 ) k+il  for 

k>4  . 

r?k,r  = £ (0.5)  k'ra+1  rim  r.:  , k>r. 

m=r*  1 

< 0.5  - { 1 + (k-r)  (0.5)2}  (0. 5)  k'r+1  for  k>r>l. 

r?i  = 1/2  = 0.5.  rjz  = 1/4  = 0.25.  r?3  = 3/16  = 0.1875. 

r?4  = 5/32  = 0.156.  r?  5 = 35/256  =0.1367.  r]6  = 126/1024  =0.123  . 

r?7  = 462/4096  = 0.1128.  r?8  = 1716/16384  = 0.1047. 

Proof:  First  we  prove  that  r}kl  = 0.5  ( 1 - (O.S)^1  ).  To 

reduce  the  rank  of  a tree  to  1,  the  procedure  Prune  ()  will 
prune  each  subtree  with  rank  greater  than  one  at  each  level 
from  level  1 to  level  k-1.  So,  by  Lemma  3.15,  the  least 
upperbound  of  the  pruning  error  for  an  equally  likely 

distribution  will  be 

r/k,i  = ( 0 . 5 ) 2 + ( 0 . 5 ) 3 + ...  + ( 0 . 5 ) k 
= (0.5)  ( 1 - (O.S)1''1  ) . 

Now,  we  prove  the  recursive  relation.  With 

Prune ( r , k , Q , S ) , pruning  a maximal  tree  with  rank  k to  a tree 
with  rank  r will  result  in  calls  to  two  subprocedures  Prune  (r- 
l,k,Q,S)  and  Prune ( r , k-1 , Q , s ) . As  the  level  of  pruning  goes 


74 


down  by  one  level,  r?k  r of  the  lower  level  reduces  to  a half 
of  that  of  the  higher  level.  So,  the  pruning  errors  will  have 
the  following  relationship. 

r?k,r  = (0 . 5)  rjk,r-i  + (0.5)  hk-i.r* 

If  we  continue  branching  the  procedure  Prune (r, k, Q, S)  at  each 
level,  we  will  get 

Hk.r  = (0*5)  77k, r-3.  + (0.5)  { (0.5)  T7k-1.r-1  + (0.5)  T7k_2  r } 

= (0 . 5)  rjkr-i  + (0.5)  V,,.,  + ...  + (Q.5)1^.^!  >r.j  + ... 


where  r+l<m<k. 

Here  we  use  mathematical  induction  to  show  the  upperbound 
*7k.r*  For  r=2 , the  above  bound  holds  because 
77k, 2 = 0.5  ( 1 - (0.5)  k~2)  - 2 (k-2  ) ( 0 . 5)  k+2 
= 0.5  - ( 0 . 5 ) k~3  - (k-2)  (0. 5)k+1 

= 0.5  - { 1 + (k-2)  (0. 5) 2)  (0. 5)k'2+1  for  k>r>l. 

Suppose  it  is  true  for  r=p-l,  where  p>3.  Then 


+ ( 0 . 5 ) k'2 { ( 0 . 5 ) rjr+2  r-!  + (0.5)  rjr+1  r)  . 
Since  f?r+i,r  = (0. 5)  r7r+i,r-i  in  a maximal  tree, 


< 0.5  - (0.5)  k'p+1  - (k-p)  (0.5)  k~p+3 . 

Thus  it  holds  for  r=p.  Therefore, 
r> k.r  < 0.5  - { 1 + (k-r)  (0. 5)2}  (0. 5)k'r+1  for  n>r>l. 

□ 


75 


Lemma  3.18:  Let  T7k  r ' be  the  least  upperbound  of  the  pruning 
error  using  algorithm  W-Prune()  and  the  tree  labelling  rule. 
When  D is  a uniform  distribution,  r?k  r < r?k  r ' for  all  k>r>0. 

Proof:  In  the  Lemma  3.15  we  assumed  only  that  D is  a uniform 
distribution  and  the  pruning  algorithm  is  the  tree  labelling 
rule.  Hence,  it  holds  for  the  procedure  W-Prune()  also.  So, 
when  D is  a uniform  distribution,  r?k  0 = r/k0'  = 0.5  for  all 
k>0 . 

Now  we  show  that  rjkl'  = 0.5  ( 1- ( 0 . 5 ) "-1)  . To  reduce  the 
rank  of  a tree  to  1,  the  procedure  W-Prune()  will  prune  each 
subtree  with  rank  greater  than  one  at  each  level  from  level  1 
to  level  n-1.  So,  by  Lemma  3.15,  the  least  upperbound  of  the 
pruning  error  for  an  equally  likely  distribution  will  be 

^k.i'  = ( 0 • 5 ) 2 + ( 0 . 5 ) 3 + ...  + ( 0 . 5 ) n 
= (0.5)  ( 1 - (0.5)"-1  ) . 

Since  k<n,  rjk  l < r?k  x ' for  all  k>l. 

Notice  that  rjk  j ' is  independent  of  k.  So,  r?k  1 ' = 0.5  (1- 

(O.S)"'1)  for  all  l<k<n. 

Now  we  show  the  resursive  relation.  With  W- 

Prune ( r , k , Q , S ) , pruning  a maximal  tree  with  rank  k to  a tree 
with  rank  r will  result  in  calls  to  two  subprocedures  W- 
Prune (r , k, Q, S)  and  W-Prune (r-1 , k-1 , Q, S) . As  the  level  of 
pruning  goes  down  by  one  level,  rjk  r of  the  lower  level 

reduces  to  a half  of  that  of  the  higher  level, 
pruning  errors  will  have  the  following  relationship: 


So,  the 


76 


T7k.r  ’ = (0.5)  TJk.r  ' + (0.5)  hk-l.r-l  1 • 

This  can  be  simplified  to 

hk , r — hk-1 , r-1  ' • 

So,  the  following  relationship  holds: 

*7k-l,r-l  Hk-2,r-2  • • • • — hk-(r-l),l'  • 

Since  r?k ,r  is  a nonincreasing  function  of  r by  Lemma  3.16  (c)  , 

hk,r  - hk.l  - hk-  ( r- 1 ) , 1 ' = ^k.r'  f°r  ail  k>r>l  follows. 

So,  when  D is  a uniform  distribution,  rj  kr  < r7k  r ' for  all 
k>r>0.  □ 

Theorem  3.19:  Under  the  assumption  of  an  egually  likely 

distribution  over  the  instance  space,  procedure  Prune (r, k,Q, S) 
with  the  tree  labelling  rule  gives  the  least  upperbound  of  the 
prunning  error  r?k  r in  time  O (2k+r+  (en/k) k)  . 

Proof:  Case  1 and  3:  By  reducing  the  rank  of  the  subtree  with 
rank  k to  r,  the  rank  of  the  whole  tree  will  be  r.  Since  rj k r 
is  a nondecreasing  function  of  r for  fixed  k by  Lemma  3.16 
(c) , it  gives  the  least  upperbound  of  the  pruning  error: 

Case  2 and  4:  There  are  only  two  alternative  pruning 
ways  which  may  give  the  least  upperbound  error. 

1)  prune  one  subtree  with  rank  k to  r-1  and  then  prune  the 
other  subtree  to  rank  r;  and 

2)  prune  one  subtree  with  rank  k to  r and  then  prune  the  other 


subtree  to  r-1. 


77 


By  Lemma  3.18,  rjkr  < rjk  r ' . So,  1)  gives  the  least 
upperbound  error. 

Case  5.  Since  r}r>0  for  all  r by  Lemma  3.15,  pruning  one 
subtree  with  rank  k-1  to  a tree  with  rank  r and  the  other 

subtree  to  rank  r-1  gives  the  least  upperbound  of  the  pruning 
error. 

□ 

Lemma  3.20  (An  upperbound  of  n^)  : 

Let  /in,r  = 0.5  - (0.5)  n~r+1  for  n>r=l 

= 0.5  - { l + (n-r)  (0.5)2)(0. 5)  n'r+1  for  n>r>l. 

Then  hk ,r  - Mn,r  < 0.5  for  0<r<k<n. 

Proof.  In  the  proof  of  Lemma  3.17  we  have  shown  that 
r?k.r  = (0.5)77^,!  + (0.5)  hk-i,r • 

Since  r)k  r < fjk.r-i  by  Lemma  3.16  (c)  , rj k,r  > hk-i.r*  So,  r?k  r is 
a nondecreasing  function  of  k.  And  since  r(Q)=k  < n for  all 
reduced  binary  decision  trees,  Q,  over  Vn,  r?k  r < r?n  r.  By  Lemma 
3.17,  for 

r=1»  hn.r  = Mn,  r ar>d  for 

r>1^  hn,r  ^ Mn.r  = 0.5  - { l + ( n - r ) ( 0 . 5 ) 2 } ( 0 . 5 ) n'r+1  for 

n>r>l . 

For  n~r>0 , the  latter  term  is  positive.  So,  nnr  is  strictly 
less  than  0.5. 

□ 


78 


3_.,5.  Sample  Size  Sufficient  for  PAC  Identification 

We  consider  the  size  of  a sample  sufficient  for  PAC 
identification  with  pruning  in  finite  domains.  Blumer  et  al. 
(1987a)  have  given  a lower  bound  of  the  sample  size  reguired 
for  PAC  identification  of  any  consistent  learning  algorithm. 
They  showed  that  for  N given  finite  rules  in  the  hypothesis 
space,  any  rule  agreeing  with  m = ( 1/e ) In  (N/<5 ) or  more 
randomly  chosen  examples  has  error  greater  than  e only  with 
probability  less  than  <5.  This  model  assumes  that  there  exists 
a rule  in  the  hypothesis  which  correctly  classifies  all  the 
training  examples. 

The  number  of  possible  rules,  N,  is  used  as  a measure  of 
the  complexity  of  the  hypothesis  space  in  PAC-learning . We 
will  use  the  following  lemma  when  we  guantify  the  hypothesis 
space  for  binary  decision  trees. 

Recall  that  Tnr  denotes  the  set  of  all  binary  decision 
trees  over  Vn  of  rank  at  most  r and  Fnr  denotes  the  set  of 
Boolean  functions  on  Xn  that  are  represented  by  trees  in  Tnr. 
Then  | Fnr  | gives  the  number  of  rules  in  the  hypothesis  space 
of  decision  trees  of  rank  at  most  r. 

Lemma  3.21  .{An  upperbound  on  \F./\:  Ehrenfeucht  and  Haussler. 
1988 ) : 

|Fnr|  < (8n)(en/r)r,  for  n>r>0 . 


79 


Consider  now  what  happens  if  we  prune  a consistent 
decision  tree.  Since  pruning  a consistent  decision  tree  built 
with  informative  variables  introduces  error  over  the  training 
samples,  the  pruned  tree  P is  not  consistent  with  all  the 
training  samples.  Angluin  and  Laird  (1988)  introduced  into 
PAC-learning  a model  of  random  errors,  or  "noise",  in  the 
sampling.  They  assumed  that  it  is  possible  to  draw  examples 
from  the  relevant  distribution  D without  error,  but  that  the 
process  of  determining  and  reporting  whether  the  example  is 
positive  or  negative  is  subject  to  independent  random  mistakes 
with  some  unknown  probability  77  < 0.5.  They  called  this 
process  "the  classification  noise  process"  and  developed  a 
result  which  guarantees  PAC-identif ication  for  the  process. 
Their  result  is  as  folllows. 

Lemma  3.22  (Angluin  and  Laird.  1988): 

For  any  finite  set  of  N rules, 
if  we  draw  a sequence  of 

m > [2/  { e 2 ( 1 — 2 77b ) 2 } ] In  (2N/<5 ) 

random  examples  for  target  concept  f from  an  arbitrary 
probability  distribution  D with  classification  noise  77  (an 
upperbound  is  rjb  < 1/2)  and  find  any  hypothesis  h that 

minimizes  the  probability  of  misclassif ication  in  the 
examples,  then 

Prob{  d (h, f ) > e ) < 6. 


80 


Our  model  does  not  assume  classification  noise.  However, 
the  pruned  tree  is  known  to  be  in  an  hypothesis  space  where  no 
function  h is  exactly  equivalent  to  the  tarqet  function  f. 
Hence,  the  concepts  we  learn  with  pruning  appear  as  if  they 
are  subject  to  a classification  error. 

Since  we  defined  the  pruning  error  over  the  instance 
space,  the  pruning  error  can  be  considered  as  a classification 
noise  in  the  random  examples. 

A direct  application  of  Lemma  3.22  with  our  results  gives 
our  main  result. 

Theorem  3.23:  For  any  n>r>0,  any  target  function  f e Frp,  p<n, 
an  equally  likely  probability  distribution  D on  Xn  and  any  0 
< e,  6 < 1,  given  a sample  S derived  from  a sequence  of 
m > [ 2/ { e 2 ( l-2/in  r)  2 } ] { ( en/r ) rln  ( 8n) +ln  ( 2/<5 ) } 
random  examples  of  f chosen  independently  according  to  D,  with 
probability  1-6,  Findmin(S)  and  Prune (r, k, Q, S)  using  tree 

labelling  produces  a hypothesis  h e Fnr  that  has  error  at  most 
e . 

Proof:  Since  we  defined  the  pruning  error  over  the  whole 

instance  space,  the  pruning  error  for  the  consistent  decision 
tree  can  be  considered  as  a classification  noise  over  the 
random  sample  without  violating  the  Angluin  and  Laird 
framework.  The  number  of  rules  N in  a decision  tree  of  rank 
at  most  r is  bounded  above  by  Lemma  3.21.  Findmin(S)  produces 
a consistent  decision  tree  and  Prune ( r, k, Q, S)  minimizes  the 


81 


probability  of  misclassif ication  over  the  whole  instance  space 
with  confidence  probability  1.  Lemma  3.22  holds  for  any 
probability  distribution  D with  r?b  < 0.5  and  for  any 
algorithm  which  minimizes  the  misclassif ication  over  the  whole 
instance  space  with  confidence  probability  1-5  which  is  less 
than  1.  Letting  r? b = and  noting  that  /xn,r  is  less  than  0.5 
for  all  n>r>0,  produces  the  desired  result. 

□ 

Combined  with  the  result  of  Findmin(S)  in  Theorem  3.6  and 
our  analysis  of  Prune (r , k, Q, S)  in  Theorem  3.19,  Theorem  3.23 
shows  that  the  decision  tree  with  rank  at  most  r on  n 
variables  can  be  learned  with  pruning  with  accuracy  l-e  and 
confidence  1-5  in  time  polynomial  in  1/e,  1/5,  l/(l-2/inr)  and 
n for  fixed  rank  r,  allowing  one  unit  of  time  to  draw  each 
random  examples.  Thus  pruning  has  increased  the  number  of 
examples  we  must  obtain,  but  still  retains  its  polynomial 
sampling  characteristic.  If  the  rank  of  the  induced  decision 
tree,  k,  can  be  estimated  probabilistically,  Theorem  3.23  can 
be  tightened  by  replacing  Mn.r  by  r. 

The  above  sample  sizes  in  Theoram  3.23  can  be  reduced  by 
using  Laird's  (1987)  improved  bound  on  learning  from  noisy 
examples.  Theorem  3.24  presents  this  bound. 

Theorem  3.24:  Assume  0 < e,  5 < 1/2.  For  any  n>r>0,  any 
target  function  f e Fnp,  p<n,  an  equally  likely  probability 


82 


distribution  D on  Xn  and  any  0 < e,  6 < 1/2,  given  a sample  S 
derived  from  a sequence  of 

m > [ 1/  { e ( 1-exp  (-(0.5)  (1-2  Mnr)2)  ) }]  { (en/r)rln(8n)+ln(l/6)  } 
random  examples  of  f chosen  independently  according  to  D,  with 
probability  1-6,  Findmin(S)  and  Prune (r, k, Q, S)  using  tree 
labelling  produces  a hypothesis  h e Fnr  that  has  error  at  most 


In  this  following  we  illustrate  findings  by  using  the 
loan  default  problem  given  in  Messier  and  Hansen  (1988).  For 
presentation  ease,  we  limit  the  instance  space  to  6 financial 
ratios.  We  are  to  find  rules  predicting  loan  default  of  a 
company  using  the  following  list  of  attributes.  Here  we 
express  all  the  variables  as  binary  variables  with  values  low 
and  high.  These  split  come  from  Messier  and  Hansen  (1988). 


Attributes 

] 

_ow 

high 

Current  Ratio 

< 

1.912 

> 

1.912 

Long-term  Debt/Net  Worth 

< 

. 486 

> 

. 486 

Lo  Long-term  Debt/Net  Worth 

< 

. 046 

> 

.046  and 

<.486 

Working  Capital/Sales 

< 

. 222 

> 

. 222 

Net  Income/Total  Assets 

< 

. 100 

> 

. 100 

Net  Income/Sales 

< 

.010 

> 

. 010 

Suppose  the  hypothesis  space 

H 

is  decision  trees. 

Then , 

number  of  rules  in  H is  |h|  = 

2 

64  = 1. 

8447 

X 

i— 1 

o 

(O 

II 

N. 

83 


Here  we  illustrate  how  decision  tree  construction  and 
algorithms  work.  First  we  construct  a consistent 
decision  tree  with  Findmin(S),  where  S is  in  Table  3.4.  (This 
is  the  same  sequence  of  examples  used  in  Messier  and  Hansen 
1988. ) 

Let  0 = low,  and  1 = high.  Algorithm  Findmin(S)  begins 
with  Find (S , 0) , and  returns  "none"  since  the  examples  in  S are 
not  in  the  same  class. 

Now  the  algorithm  calls  Find(S,l) . "Current  Ratio"  is  an 
informative  variable.  In  Step  3. a it  calls  two  subprocedures, 
Find(S0  ,0)  and  Find(S1/,0).  Find(S0v,O)  is  successful  but 

Find(S1v,0)  is  not  successful.  In  Step  3.c  it  calls 

Find(S1v,l).  We  see  that  any  set  of  examples  in  each  branch 
of  each  variable  (attribute)  are  not  in  the  same  class.  Hence 
Find (Si  ,1)  returns  "none".  In  a similar  procedure  we  will  see 
that  Find(S,l)  returns  "none"  since  for  any  other  informative 
variables  than  "Current  Ratio",  S0V  is  not  in  the  same  class 
and  SiV  is  not  in  the  same  class,  respectively. 

Now  Findmin(S)  calls  Find(S,2).  "Current  Ratio"  is  an 
informative  variable.  In  Step  3.  a it  calls  two  subprocedures, 
Find (S0  ,1)  and  Find(S1v,l).  Find(S0',l)  is  successful  but 

Find  (SiV,  1 ) is  not  successful.  In  Step  3.c  it  calls 

Find(S1v,2).  Now  let  S = S^.  "Long-term  Debt/Net  Worth"  is 
an  informative  variable  in  this  subsample.  In  Step  3. a it 
calls  two  subprocedures,  Find(S0v,l)  and  Find(Siv,l).  We  see 
that  both  subprocedures  are  successful.  Hence  Find (S, 2) 


84 


Table  3.4:  Sample  S for  Loan  Default  problem 
Values  of  Attributes' 


(1) 

(2) 

2 . 767 

0.286 

1.540 

0.985 

1 . 680 

0.203 

1 . 548 

0.334 

3 .460 

0.415 

2 . 539 

0.656 

3 . 085 

0.102 

1.762 

0.375 

5 . 094 

0.486 

2 . 521 

0.239 

1.770 

0.088 

2 . 826 

0.485 

1.269 

1.133 

4 . 042 

0.472 

2 .246 

0.972 

1.879 

0.879 

2 . 582 

0 

2 . 338 

0.043 

2 . 159 

0.239 

2 . 005 

0.748 

2 . 156 

1.089 

1.940 

1.237 

1.977 

0.161 

1.920 

5.474 

2 .330 

0.049 

1.930 

0.621 

1.904 

0.042 

1.788 

0.401 

2.419 

0.074 

2 . 080 

0.932 

1.341 

12 . 780 

2 . 616 

0.185 

(3) 

(4) 

0.286 

0.383 

n/a 

1.171 

0.203 

0.169 

0.334 

0.211 

0.415 

0.393 

n/a 

0.440 

0.102 

0.251 

0.375 

0.175 

n/a 

0.469 

0.239 

0.223 

0.088 

0.223 

0.485 

0.288 

n/a 

0.072 

0.472 

0.697 

n/a 

0.236 

n/a 

0.178 

0 

0.445 

0.043 

0.378 

0.239 

0. 113 

n/a 

0.170 

n/a 

0.204 

n/a 

0.266 

0.161 

0.211 

n/a 

0 . 149 

0.049 

0.276 

n/a 

0. 145 

0.042 

0.158 

0.401 

0.202 

0.074 

0.161 

n/a 

0.208 

n/a 

0.091 

0.185 

0.235 

(5) 

(6) 

0.045 

0.037 

0.017 

0.031 

0.023 

0.014 

0.034 

0.023 

0.052 

0.037 

0.045 

0.524 

0. 136 

0.066 

0.107 

0 . 069 

0.132 

0.153 

0.085 

0.048 

0.033 

0.031 

0 . 038 

0.033 

0.052 

0.130 

0.049 

0.048 

0.036 

0.035 

0.043 

0.028 

0.143 

0.154 

0.055 

0.054 

0 .019 

0 . 005 

0.089 

0.530 

0.033 

0.021 

0 . 018 

0 .013 

0 . 066 

0.062 

0.015 

0.011 

0.052 

0.041 

0.024 

0.009 

0.076 

0.054 

-0.063 

-0.043 

0.127 

0.062 

0.054 

0.059 

-0.046 

-0.051 

0.090 

0.115 

Class** 

N 

D 

D 

D 

N 

D 

N 

D 

D 

N 

D 

N 

D 

N 

D 

D 

N 

D 

N 

N 

N 

D 

N 

N 

N 

D 

D 

D 

N 

N 

D 

N 


*•  Attributes  in  each  column  have  following  names. 

(1)  Current  Ratio  (2)  Long-term  Debt/Net  Worth 
(3)  Lo  Long-term  Debt/Net  Worth  (4)  Working  Capital/Sales 
(5)  Net  Income/Total  Assets  (6)  Net  Income/Sales 
**:  (N)  Not  Default,  (D)  Default 
***:  (n/a)  not  applicable 


85 


returns  a consistent  decision  tree  of  rank  two  in  Figure  3.1. 
Thus  Findmin (S)  returns  a minimal  rank  decision  tree  of  rank 
two  consistent  with  sample  S. 


Current  Ratio 

/ \ 

low  / \ high 

/ \ 

Default  Long-term  Debt 

(10)  Net  Worth 


low 


/ 


/ 


/ 


Lo  Long-term  Debt 
Net  Worth 

/ \ 

low  / \ high 

/ \ 

Net  Income  No  Default 

Total  Assets  (ii) 

/ \ 


ow  / 

/ 

Default 

(1) 


\ high 

\ 

No  Default 

(1) 


\ 

\ high 

\ 

Working  Capital 
Sales 

/ \ 

low  / \ high 

/ \ 

Net  Income  Default 
Sales  (4) 

/ \ 

low  / \ high 

/ \ 

Default  No  Default 
(1)  (4) 


*:  The  number  of  examples  in  each  terminal  node. 
Figure  3.1:  A decision  tree  predicting  loan  default, 


Suppose  we  want  a decision  tree  of  rank  one.  We  prune 
the  above  decision  tree  in  Figure  3.1  with  Prune ( r, k, Q, S) . 
Here  r=l,k=2,  Q=  the  decision  tree  in  Figure  3.1,  and  S 
is  the  sample  used  in  Messier  and  Hansen  (1988).  Since  Q is 
in  Case  3,  Prune ( 1 , 2 , Q, S)  calls  one  subprocedure, 
Prune  ( 1 , 2 , Q: , Sx)  with  Q = Qx  and  3 = 3!.  Here,  the  procedure 
calls  two  subprocedures,  Prune  ( 0 , 1 , Q0 , S0)  and  Prune  ( 1 , 1 , Qx , sj  , 
since  the  new  Q is  in  Case  5.  By  Step  2,  Prune  ( 0 , 1 , Q0 , S0) 


86 


returns  "Q  = No  Default"  by  Tree  Labelling.  By  Step  1, 
Prune  ( 1 , 1 , Q:,  Sx)  returns  Q1.  Hence  Prune  ( 1 , 2 , Q , S)  returns  a 
pruned  decision  tree  of  rank  one  in  Figure  3.2. 


Current  Ratio 


/ 

low  / 

/ 

Default 

(10)* 


\ 

\ high 

\ 

Long-term  Debt 
Net  Worth 


/ 

low  / 

/ 

No  Default 
(12:1)** 


\ 

\ high 

\ 

Working  Capital 
Sales 


/ \ 

low  / \ high 

/ \ 

Net  Income  Default 
Sales  (4) 

/ \ 

low  / \ high 

/ \ 

Default  No  Default 
(1)  (4) 


*:  The  number  of  examples  in  each  terminal  node. 

**:  12  examples  are  in  No  Default,  One  example  is  in  Default. 

Figure  3.2:  A pruned  decision  tree  predicting  loan  default. 


Now  we  give  sample  sizes  sufficient  for  learning  a 
decision  tree  with  pruning.  We  use  the  above  decision  tree  of 
rank  one  pruned  from  an  induced  decision  tree  of  rank  two. 
Here  n = 6,  k = 2,  and  r = 1.  Then,  by  Theorem  3.23,  the 
number  of  examples,  m,  sufficient  for  learning  a decision  tree 
of  rank  one  is 

for  e = 0.5  and  6 = 0.01,  m = 560,609,  and 
for  e = 0.1  and  S = 0.01,  m = 14,015,240. 


87 


The  above  sample  sizes  are  very  large  because  of  the 
loose  bounds  of  /in  r and  Fnr.  Since  k = 2,  /i6,i  can  be 

substituted  by  the  least  upper  bound  of  the  pruning  error,  r?2  i 
= 0.25.  In  fact,  | Fnr  | < N = 26A  for  any  value  of  r.  With 

these  tighter  bounds  the  number  of  examples,  m,  sufficient  for 
learning  reduces  to 

for  e = 0.5  and  5 = 0.01,  m = 6,356  and 

for  e = 0.1  and  <5  = 0.01,  m = 158,895. 

By  using  Theorem  3.24  and  tighter  bounds  for  r and  Fnr, 
the  sufficient  sample  size  could  be  reduced  as  follows: 

For  e = 0.499  and  6 = 0.01,  m = 833, 

for  e = 0.2  and  S = 0.01,  m = 2,084,  and 

for  e =0.1  and  6 = 0.01,  m = 4,168  are  obtained  as  the 
number  of  examples  sufficient  for  learning. 

Hence,  the  learned  concept  (that  is,  a pruned  decision 
tree  of  rank  at  most  one)  h has  error  less  than  10%  with 

probability  greater  than  99%  after  4,168  random  independent 
examples . 


3.6  Other  Pruning  Rules 

In  Section  3.4,  we  determined  the  least  upperbound  of  the 
pruning  error  with  the  assumption  of  an  egually  likely 
distribution  and  the  tree  labelling  rule.  Here  we  give 
additional  insight  into  the  pruning  error  by  considering  some 


problematic  cases. 


88 


In  Section  3.4,  we  saw  that  sample  labelling  under  a 
nonuniform  distribution  could  give  a pruning  error  greater 
than  0.5.  Now  consider  the  sample  labelling  rule  under  the 
assumption  of  an  equally  likely  distribution.  Table  3.5 
presents  an  example  which  shows  the  least  upperbound  of  the 
pruning  error  may  be  greater  than  0.5.  We  use  the  same 


notation  as  in  Tables  3.2  and 

3.3  of  Section 

3.4. 

Table  3.5: 

Sample 

labelling 

with  a uniform 

distribution 

X 

f (X) 

D (x) 

n (x)  f q ( x ) f 

PCQ)  (x) 

(1,1, 1,1) 

1 

1/16 

1 1 

1 

(1,1, 1,0) 

0 

1/16 

1 0 

0 

(1,1, 0,1) 

1 

1/16 

1 1 

0 

(1, 1,0,0) 

0 

1/16 

2 0 

0 

(1,0, 1,1) 

1 

1/16 

1 1 

0 

(1,0, 1,0) 

1 

1/16 

0 1 

0 

(1,0,0, 1) 

1 

1/16 

1 1 

0 

l-> 

o 

o 

o 

0 

1/16 

3 0 

0 

(0,1, 1,1) 

1 

1/16 

2 1 

0 

(0,1, 1,0) 

1 

1/16 

0 1 

0 

(0,1, 0,1) 

1 

1/16 

0 1 

0 

(0,1, 0,0) 

1 

1/16 

0 1 

0 

(0,0, 1,1) 

1 

1/16 

1 1 

0 

o 

o 

i-1 

o 

1 

1/16 

0 1 

0 

(0,0, 0,1) 

1 

1/16 

1 1 

0 

(0,0, 0,0) 

0 

1/16 

6 0 

0 

89 


In  this  decision  tree,  r(Q)  = 2 and  r(P(Q))  = 1.  The 
pruning  error  in  this  case  is 

Prob  { x : f(x)  * fp<Q)(x)  } - Prob  { x : f(x)  * fQ(x)  } 

= 7/16  + 3/16  + 1/16  = 11/16  > 0.5. 

We  now  turn  to  labelling  rules  that  are  not 
deterministic.  In  such  methods  we  may  use  some  probabilistic 
labelling  procedure  where  the  pruned  node  has  a probability  of 
the  target  function  f(x)  = 0 and  f(x)  = 1 in  proportion  to  a 
corresponding  sample  or  instance.  As  we  will  see,  the  pruning 
error  may  exceed  0.5  for  both  sample  labelling  and  tree 
labelling.  Below  we  define  two  nondeterministic  labeling 
methods . 

Definition  3.25  (Nondeterministic  Label  1 i n a) : Let  i be  an 

internal  node  in  a decision  tree  Q,  and  Q(i)  be  a subtree  of 
Q such  that  node  i is  the  root  of  the  tree  Q(i) . 

1.  Sample  labelling: 

Let  s ( i)  be  a subset  of  the  sample  S used  to  construct  Q 
from  its  root  to  node  i. 

Let  s0(i)  denote  the  number  of  negative  examples  in  s(i) 
and  3,(1)  denote  the  number  of  positive  examples  in  s(i).  If 
we  prune  Q at  i , then  label  i as 

j 0 with  Probability  s0 ( i)  / (s0(i)  + Sl(i))  and 
I 1 with  probability  Sl(i)  / (s0(i)  + Sl(i)). 


2.  Tree  labelling: 


90 


Let  Vii,‘*’,Vip  be  the  nodes  in  the  path  from  the  root  to 

the  parent  of  node  i in  Q.  And  let  a.  a,  be  the  labels 

1 P 

of  each  node,  where  i1/...,ip  are  p distinct  indices  of 

{ 1 , . . . , n } . 

Let  l0(i)  = |{x:  f0  (x)  =0 } | , where  x e Xn  such  that 

ain  / • • • 1 ai  are  the  same . 

1 P 

Let  li(i)  = | { x : f0 (x)  =1 } J , where  x e Xn  such  that 

ail,-..,aip  are  the  same.  If  we  prune  Q at  i , then  label  i as 

I 0 with  probability  l0(i)  / (l0(i)  + ll(i))  and 
I 1 with  probability  l2(i)  / (l0(i)  + l: (i) ) . 

By  using  the  case  shown  in  Table  3.5,  we  show  that  the 
pruning  error  may  exceed  0.5  with  both  nondeterministic 
labelling  methods.  if  we  use  nondeterministic  sample 
labelling,  then  the  label  of  the  pruned  node  may  have  a 0 or 
1 label  with  the  corresponding  probabilities  defined  in 
Definition  3.24.  Since  s0  ( ) are  positive  for  all  pruned  nodes 
of  P (Q)  in  the  case  shown  in  Table  3.5,  all  pruned  nodes  may 
hsve  0 labels.  in  that  case  the  pruning  error  is 
= 0.5(0.875-0)  + 0.25(0.75-0)  + 0.125(0.5-0) 

= 0.6875  > 0.5. 

Similarly,  since  10()  are  positive  for  all  pruned  nodes  of 
P(Q)  r 3.11  pruned  nodes  may  have  0 labels.  in  that  case  the 
pruning  error  for  tree  labelling  is  equal  to  that  for  sample 
labelling  calculated  above. 


So,  even  for  an  equally  likely 


91 


distribution  D,  the  least  upperbound  of  the  pruning  error  may 
exceed  0.5  under  nondeterministic  labelling. 

However,  if  we  use  the  nondeterministic  tree  labelling 
the  upperbound  of  the  average  pruning  error  is  less  than 
or  egual  to  0.5  under  a uniform  distribution.  The  following 
lemma  shows  this. 

Lemma  3.26:  When  D is  a uniform  distribution  and  Prune ()  uses 
the  nondeterministic  tree  labelling  rule,  the  upperbound  of 
the  average  pruning  error,  mk r,  is  less  than  or  egual  to  0.5 
for  all  k>r>0. 

Proof:  Let  Q e Wnk.  Suppose  some  internal  node  "i"  in  Q is  at 
level  m (men)  . Further  suppose  that  we  prune  Q to  P(Q)  at 
node  "i"  leaving  "i"  a leaf  node  in  P(Q) . Let  p("i")  be  the 
probability  of  instances  covered  by  the  leaves  of  the  subtree 
Q(i)  in  Q.  Under  a uniform  distribution,  p("i")  = 2n~m/2n  = 
(0*5)  . Let  e("i")  be  the  probability  of  incorrect 
classifications  of  the  instances  covered  by  the  leaves  of 
subtree  Q(i)  over  the  whole  instance  space. 

By  the  definition  of  l0(i)  and  l:(i)  in  Definition  3.8, 
lo(i)+li(i)  - 2 . Suppose  a0(i)  is  the  number  of  incorrect 

classifications  of  l0(i)  and  a1(i)  is  the  number  of  incorrect 
classifications  of  Mi).  Then  clearly 

e("i")  = (0.5)m  ( a0(i)  + Mi))  / ( Mi)  + Mi)  ) 


92 


where  0 < a0(i)  < l0(i)  and  0 < a,(i)  < l,(i). 

If  we  prune  Q at  node  "i",  then  in  P(Q)  the  label  of  the 
leaf  "i"  is 

0 with  probability  l0(i)  / (10 (i)  + 1,(1))  and 

I 1 with  probability  1, ( i)  / (l0(i)  + 1,(1)). 

Below  we  suppress  (i)  for  clearer  reading.  If  the  label  of 
"i"  is  0,  then 

e("i")  = (0.5)”  ( i - (l0  - a0  + a,)/(l0  + l,)  ) 
since  the  true  probability  of  a 0 label  is  (10  - a0  + a,)  / (10 
+ 1,)  . 

By  the  definition  of  the  pruning  error, 
e(Q,P(Q))  = e (P(Q) ) - e(Q) 

= (0.5)m  [((1,  + a0  - a,)  - (a0  + a,))  / (10  + 1,)] 

= ( 0 • 5 ) m ( 1,  - 2a,  ) / (10  + 1,)  . 

Similarly,  if  the  label  of  node  "i"  is  1,  then 
e(Q,P(Q))  = (0.5)“  ( 10  - 2a0  ) / (10  + 1,)  . 

So,  the  average  pruning  error  is 

= Uo  / (10  + li))  * ( 0 . 5 ) m ( 1,  - 2a,  ) / (10  + l,) 

+ di  / do  + 1,))  * ( 0 . 5 ) m ( 10  - 2a0  ) / (10  + 1,) 

= (0-5)m  ( 2 101,  - 2 l0a,  - 21,a0  ) / (10  + l,)2 
- (°-5)m  ( 2101,  ) / (10  + l,)2  < ( 1/2 ) ( 0 . 5 ) m 
since  4101,  < (10  + l,)2  and  a0,  a,  > 0. 

Since  the  above  bound  holds  for  all  pruned  nodes,  the 
upperbound  of  the  average  pruning  error,  mk  r,  is  less  than  or 
equal  to  0.5. 

□ 


93 


So,  the  nondeterministic  tree  labelling  rule  can  hedge 
the  risks  of  biased  nonrandom  selections  of  examples  by 
guaranteeing  that  the  average  pruning  error  does  not  exceed 
0.5  regardless  of  the  sampling  distribution. 

3.7  Chapter  Summary 


Empirical  results  have  shown  that  pruning  can  improve  the 
accuracy  of  an  induced  decision  tree.  It  also  leads  to  more 
concise  rules.  Here  we  provide  a pruning  algorithm  based  on 
the  rank  of  a decision  tree.  A bound  on  the  error  due  to 
pruning  by  the  rank  of  a decision  tree  is  determined  under  the 
assumptions  of  an  equally  likely  distribution  of  the  instance 
space  and  a deterministic  tree  labelling  rule.  This  bound  is 
then  used  with  recent  results  in  learning  theory  to  determine 
a sample  size  sufficient  for  PAC  identification  of  decision 
trees  with  pruning.  We  also  discuss  other  pruning  rules  and 
their  effects  on  the  error  due  to  pruning.  With 
nondeterministic  tree  labelling  rule  we  show  that  the 
upperbound  of  the  average  pruning  error  is  less  than  or  equal 
to  0.5  under  an  equally  likely  distribution. 

Future  work  will  be  needed  to  determine  the  pruning  error 
under  more  general  assumptions  on  the  distribution  over  the 
instance  space.  Also,  Theorem  3.23  can  be  tightened  if  an  a 
prior  estimate,  k,  of  the  rank  of  the  induced  decision  tree 
can  be  determined.  If  so,  iin  r can  be  replaced  by  nk  r. 


94 


Here  we  take  the  rank  of  a tree  as  a conciseness  measure 
of  a decision  tree.  Future  work  will  be  needed  to  assess  the 
effect  of  pruning  under  other  conciseness  criteria. 


CHAPTER  4 

THE  ACCURACY  OF  A PRUNED  DECISION  TREE 
4 . 1 Introduction 

In  the  previous  chapter  we  provided  a bound  on  the 
training  sample  size  to  guarantee  PAC  learning  of  decision 
trees  with  pruning.  In  many  learning  environments  it  is  often 
not  possible  to  obtain  a sample  large  enough  to  guarantee  PAC 
learning,  in  such  cases  it  is  of  value  to  obtain  a posterior 
evaluation  of  the  accuracy  of  a pruned  decision  tree.  For 
this  purpose,  a set  of  examples,  called  the  test  set,  is 
employed.  Here  it  is  important  to  ensure  that  the  test  set 
can  be  considered  as  independent  of  the  sample  used  in 
constructing  the  decision  tree,  and  drawn  from  the  same 
distribution.  So,  a test  set  is  often  a holdout  sample 
randomly  removed  from  original  cases  available  for  training. 
Often  the  holdout  sample  is  1/3  or  1/2  of  the  original  sample 
available  for  training  though  there  is  no  theoretical 
justification  for  that  ratio. 

Tsai  and  Koehler  (1991)  give  an  excellent  result  on  the 
posterior  measurement  for  the  error  and  confidence  parameters 
(e,  <S)  . They  assume  that  there  is  a rule  which  is  consistent 
with  all  training  samples.  The  use  of  their  result  for  pruned 


95 


96 


trees  is,  however,  inappropriate  since  pruning  a decision  tree 
always  leads  to  an  inconsistent  decision  rule.  In  this 
chapter  we  present  some  methods  for  error  and  confidence 
parameter  estimation  for  a pruned  decision  tree. 


— — 2 The  Estimation  of  Error  of  a Pruned  Decision  Tree 

this  section  we  present  three  methods  for  error  and 
confidence  parameter  estimation  for  the  pruned  decision  tree. 
First  recall  the  basic  definitions  given  in  Chapters  2 and  3. 

Let  X be  the  instance  space  of  interest.  The  target 
concept  f maps  X into  {0,1}.  Similarly,  for  any  other  concept 
h,  we  have 

h : X - (0,1). 

The  error,  d(h,f) , of  a learned  concept  h is  the  probability 
of  the  instances  incorrectly  classified  by  h.  That  is, 
d(h,f)  = Prob{  x e X:  h(x)  * f(x)  }. 

Pr°b { } is  determined  by  an  arbitrary  sampling  distribution,  D, 
over  X.  Sampling  is  assumed  to  be  with  replacement  with 
samples  drawn  independently. 

Let  0 = d(h,f),  where  h is  a pruned  tree  P(Q) , and  f is 
the  target  concept  (the  correct  decision  tree) . The  following 
result  gives  a posterior  estimate  of  the  accuracy  of  a pruned 
tree  P(Q),  determined  using  the  independent  test  sample  of 


size  m. 


97 


4.2.1 — When  No  Prior  Information  Exists 

We  may  assume  a uniform  distribution  of  error  when  there 
is  no  reliable  information  for  the  distribution  of  error.  The 
following  result  gives  a posterior  estimate  of  the  accuracy  of 
a pruned  decision  tree  determined  using  a test  sample  assuming 
a uniform  prior  (Raiffa  and  Schlaifer,  1961;  Winkler  1972). 

Theorem  4.0:  (Posterior  Estimate  of  the  Confidence  Parameter) 
Given  b misclassif ications  (i.e.  failures)  in  a test 
sample  of  size  m,  the  posterior  estimate  of  the  binomial 
parameter,  0,  using  a uniform  prior  is 

b 

Prob{  0 * e | b,  m } = £ CjV(l-t)®1-*  for  e ^ 0.5 

k= o 

and 

m*  1 

Prob{  0 > e | b,  m } = £ C?*1  (1-e)  kcn"1-k  for  e > 0.5. 

k-m* 1 -b 

Here  Ck  is  the  number  of  ways  of  unordered  sampling  of  size 
k out  of  m. 

The  terms  in  Theorem  4.0  are  easily  computed  using  the 
Incomplete  Beta  distribution  and  methods  given  in  Abramowitz 
and  Segun  (1968)  or  approximated  using  methods  given  by  Peizer 
and  Pratt  (1968).  As  indicated  by  Tsai  and  Koehler  (1991),  a 
uniform  prior  is  often  inappropriate  since  the  error  often  has 
high  probability  in  a certain  range. 


98 


In  the  next  subsection  we  develop  probablistic  estimates 
of  the  error  that  are  based  on  the  assumption  of  a Beta  prior 
distribution. 


— •-? • 2 Using  the  Information  of  the  Training  Sample 

We  may  assume  a Beta  prior  to  obtain  a parameter  set 
consistent  with  the  training  information.  Then,  using  the 
test  sample,  we  can  determine  a posterior  estimate  for  the 
error.  To  derive  various  estimates  we  will  need  Hoeffding's 
inequality. 

Lemma  4.1:  (Hoeffdinq  Inequalities;  Hoeffdinq.  1963^ 

Let  xlt  x2,  ...  , xn  be  independent  random  variables  with 
0 - xi  < 1 and  E[xj  = /x,  for  i = 1,  2,  ...,  n, 

x =(1Lx1)/n. 

i=  1 

Then, 

Prob{  x - n > c } < exp  ( -2nc2) 

and 

Prob{  n - x > c } < exp ( -2nc2)  . 


Suppose  that  there  are  b misclassi f ications  out  of  the 
training  sample  of  size  m.  By  Hoeffding's  inequality,  the 
following  bound  holds. 


99 


Lemma  4.2: 

For  any  b/m  < e , 

Prob{  9 > e } < exp ( -2 ( e -b/m)  2m)  . 

Proof:  Let  xA  = 1 if  a pruned  decision  tree  misclassif ies  the 
iLn  example  and  = 0 otherwise.  The  probability  of 

misclassif ication  0 is  the  expected  value  E[xx].  Since  we 
have  b misclassif icat ions  in  the  independent  sample,  x = b/m. 

Prob{  0 > e } = Prob{  0 - b/m  > e - b/m  } < exp  (-2 (e- 
b/m)  2m)  holds  by  Lemma  4.1. 

□ 


Let  Z+  be  the  set  of  positive  integers  and  let  S(b,m)  be 
the  set  { (p / q) : p,q  € Z+  } where,  for  all  b/m  < e < 1,  Ic(p,q) 
is  an  Incomplete  Beta  distribution  that  is  consistent  with  the 
bound  in  Lemma  4.2.  That  is,  S(b,m)  is  the  set  of  integer 
parameters  for  Incomplete  Beta  distributions  that  are 
consistent  with  the  bound  in  Lemma  4.2.  In  the  following  we 
develop  a characterization  of  S(b,m). 

First,  Lemma  4.3,  4.4  and  4.5  develop  necessary 
conditions  for  consistent  (p,q)  set.  Suppose  that  there  are 
b misclassif ications  out  of  the  training  sample  of  size  m. 

Lemma  4.3:  For  p,q  e Z*  and  b/m  < e < 1,  where  1 < b < m-1, 
then 


(P , q)  e S(b,m) 


100 


only  if  p + q > l + 4m ( 1-e ) ( e -b/m) , where  e is  the 
root  of  the  equation  -2 ( 1-e ) In ( 1-e ) = e - b/m. 

Proof:  By  Lemma  4.2,  any  consistent  Beta  prior  must  satisfy 
Pr°b(  9 > e } = 1 - Ic(p,q)  < exp  ( -2  (e -b/m)  2m) 
for  all  b/m  < e < 1.  Here  Ic(p,q)  is  the  Incomplete  Beta 
which  qives  the  probability  of  a value  less  than  or  equal  to 
e.  Since  p and  q are  inteqers,  the  above  inequality  can  be 
written  with  a Binomial  distribution  as  follows: 
if  C*+P_1  CJc(l-e)p*‘7'1';c  £ e-2  (t-b/m)2m 

Jc-  0 

By  rearranqing  terms,  the  following  inequality  holds: 

^ In (1-e) *2 (e -b/m)  2m  ^ , 

k=0 

Then,  for  k=0, 

(p+q-1)  In  ( 1-e ) + 2(e-b/m)2m  < 0. 

That  is, 

P+q-1  > -2 (e-b/m)  2m  / ln(l-e). 

Let  f (e ) = -2 ( e -b/m)  2m  / ln(l-e). 

Since  f(e)  is  continuous  and  concave  on  an  interval  where  its 
maximum  occurs  ( f" (e)  < o at  unique  e satisfying  f 1 (e)=0  if 
In (1-e)  < 0.5),  by  taking  its  first  derivative  and  setting  it 
equal  to  0,  we  get  e giving  the  maximum  value  of  f(e).  I.e., 

f'(e)  = 0 reduces  to 

-2 (1-e) ln(l-e)  = e - b/m. 

So , 

p+q  > 1 - 2 ( e -b/m)  2m  / ln(l-e)  = 1 + 4m ( 1-e ) ( e -b/m) , 


101 


where  e is  the  root  of  the  equation  -2 ( 1-e ) In ( 1-e ) = e - b/m, 
is  obtained  as  a necessary  condition. 

□ 

Below,  Lemma  4.4  gives  a relationship  between  the 
consistent  Beta  priors  in  S(b,m). 

Lemma  4.4:  (Tsai  and  Koehler  1991)  Assume  p,  q e Z+.  If 

(p,q)  e S(b,m) 

then 

( t , q)  e S (b,  m)  1 < t < p 
(p,t)  e S (b,m)  t > q. 

Lemma  4.5  gives  a necessary  condition  for  parameter  q in 
S (b,m) . 

Lemma  4.5:  For  p,q  e Z*  and  b/m  < e < 1,  where  1 < b < m-1, 
then 

(p,q)  e S (b , m) 

only  if  q > 4m ( 1-e ) ( e -b/m) , where  e is  the  root  of  the 
equation  -2 ( 1-e ) In ( 1-e ) = e - b/m. 

Let  q be  the  minimum  q satisfying  the  above  condition. 
Then  (l,q)  is  a consistent  Beta  prior. 

Proof:  Let  p = 1.  Then 

Prob{  0 > e ) = 1 - Ie ( 1 , q)  = (l-e)q  < exp (-2 (e -b/m)  2m) 
By  taking  logarithms,  we  get 


102 


q ln(l-e)  < -2 (e -b/m)  2m. 

By  a similar  reasoning  as  in  the  proof  of  Lemma  4.3,  we  get 
q > 4m(l-e) (e-b/m) , where  e is  the  root  of  the  equation 
- 2 ( 1-e ) In ( 1-e ) = e - b/m. 

Let  q be  the  minimum  q satisfying  the  above  condition. 
Then  (l,q  ) is  a consistent  Beta  prior  since  it  satisfies  the 
necessary  condition  in  Lemma  4.3  and  the  probability  bound 
holds  for  any  e,  where  b/m  < e < 1. 

Suppose,  by  contradiction,  (l+k,q"-k),  where  k e is 

a consistent  Beta  prior.  Note  that  this  Beta  prior  satisfies 
a necessary  condition  in  Lemma  4.3.  Then  by  Lemma  4.4,  (l,q- 
k)  is  also  a consistent  Beta  prior.  So,  q"  is  not  the  minimum 
*3/  qiven  p=l,  consistent  with  above  bound.  Hence  we  reach  a 
contradiction.  So,  (l+k,q-k)  is  not  a consistent  Beta  prior 
for  any  k e Z+. 

Since  for  any  q < q',  the  minimal  possible  p is  not 
consistent,  consistent  q should  be  greater  than  or  equal  to  q" 
for  any  p e Z+. 

Hence,  (p,q)  e S(b,m) 
only  if  q > 4m ( 1-e ) ( e -b/m) , 

where  e is  the  root  of  the  equation  -2 ( 1-e ) In ( 1-e ) = e - b/m. 

□ 


Suppose  (p,q)  is  a consistent  Beta  prior.  Further 
suppose  there  are  b2  misclassif ications  in  a test  sample  of 


103 


size  m2.  Then  the  posterior  distribution  will  have  parameters 
(p+b2,  q+m2-b2)  . The  worst  possible  confidence  factor,  <5 , is 

6 = Sup  1 - Ic  (p+b2,  q+m2-b2) 
subject  to 

(p,q)  e S (b,m) 

Pf  q c z+ 

By  Lemma  4.4,  we  know  that  the  term 
ic  (P+b2,  q+m2-b2) 

decreases  with  increases  in  p and  decreases  in  q.  So,  the 
supremum  may  be  attained  at  (p,  q) , where  p is  the  maximal  p 
among  all  possible  p,  and  q is  the  minimal  q among  all 
possible  q.  However,  as  p increases,  the  consistent  minimal 
q increases  (more  strictly,  it  does  not  decrease)  by  Lemma 
4.4.  So,  in  most  cases,  we  cannot  obtain  such  an  ideal 

(maximal  p,  minimal  q)  set.  Therefore,  we  need  to  get  a more 
detailed  relationship  for  the  consistent  parameter  set. 

Below,  Lemma  4.6  gives  a relationship  between  a 
consistent  Beta  prior  (p,q)  and  (p',q'),  where  p'  > p,  and  q' 

> q.  Here  we  will  see  that  the  decision  of  determining  the 
parameter  set,  giving  the  higher  confidence  factor,  <5,  depends 
on  the  value  of  e . 

Lemma  4.6:  Assume  k e Z+.  Then  the  following  are  true. 


104 

a)  Suppose  (p,q)  and  (p+k,q+l)  are  consistent  Beta  priors, 

where  k e Z+. 

If  q/(P+q)  > b/m,  then  (p+k,q+l)  gives  a higher 

confidence  factor  for  b/m  < e < a , and  (p,q)  gives  a higher 

confidence  factor  for  a < e < l,  where  b/m  < a < 1.  As  k 

increases,  a increases. 

q/  (P+q)  < b/m,  and  k=l,  then  (p,q)  gives  a higher 

confidence  factor  than  (p+l,q+l)  for  all  e,  b/m  < e < 1. 

In  no  case  is  a > 1. 

b)  Suppose  (p,q)  and  (p+l,q+k)  are  consistent  Beta  priors, 

where  k e Z+. 

q/(P+q)  - b/m,  then  (p+l,q+k)  gives  a higher 

confidence  factor  for  b/m  < e < /? , (p,q)  gives  higher 
confidence  factor  for  /3  < e < 1,  where  b/m  < /3  < 1.  As  k 
increases,  (3  decreases. 

For  example,  for  k=2 , and  p,  q are  large  enough  compared 
to  1,  and  p/ q is  close  to  1/2,  then  (3  is  close  to  1/3.  As  p/q 
increases,  f3  also  decreases. 

Proof:  a)  Suppose  k=l.  From  Abramowitz  and  Segun  (1968),  by 
combining  formula  26.5.15  and  26.5.16,  we  get 
q(k)  - g(l)  = Ic(p,q)  - Ic(p+l,q+l)  = K * (1  - (p+q)/q  e), 
where  K = £p(1-c)^  r(p+q)  / r(p+l)r(q). 

Since  K is  positive,  g(l)  - Ic(p,q)  - Ic(p+l,q+i)  > o for  b/m 
< e < q/ (p+q) , and  g(l)  < o for  q/(p+q)  < e < 1.  So,  if 
q/ (P+q)  > b/m,  then  (p+l,q+l)  gives  a higher  confidence  factor 


105 


for  b/m  < e < q/(p+q),  (p,q)  gives  a higher  confidence  factor 
for  q/ (p+q)  < e < 1.  If  q/(p+q)  < b/m,  then  (p,q)  gives  a 
higher  confidence  factor  for  all  e,  b/m  < e < 1. 

Now  we  can  derive  a general  formula  for  g(k)  by 
substituting  It  (p+ (k-1) , q+l)  for  I£(p+k,q+l)  by  using  the 
formula  26.5.16  of  Abramowitz  and  Segun. 

g(k)  = Ic  (p , q)  - Ie  (p+k,  q+l ) = K * (1  - (p+q)/q  e + (1- 
e)R(k,e) ) , 

where  R(k,e)  is  a sum  of  k-1  finite  positive  terms. 

Note  that  as  k increases  by  one,  a new  positive  term  is  added 
to  R(k,e)  without  changing  existing  terms.  Since  R(k,e) 
becomes  larger  as  k increases,  e needs  to  be  larger  to  make 
g(k)  >0.  That  is,  as  k increases,  a increases. 

By  choosing  e close  enough  to  1,  g(k)  can  be  negative  for  any 
k e Z . For  such  an  e < 1 , (p,q)  gives  a higher  confidence 
factor  than  (p+k, q+l). 

So,  a range  of  e,  where  (p,q)  gives  a higher  confidence 
factor,  is  always  nonempty  for  any  k e Z+. 

By  Lemma  4.4,  1 - Ic  (p+k, q+l)  increases  as  k increases. 
So,  the  range  of  e,  where  (p+k, q+l)  gives  a higher  confidence 
factor,  is  also  always  nonempty  for  any  k e Z"  if  q/ (p+q)  > 
b/m . 

b)  In  the  proof  of  a)  we  showed  that 

h(k)  - h(l)  = Ic(p,q)  - IE(p+l,q+l)  = K * (1  - (p+q)/q  e), 
where  K = ep(l-e)^  r(p+q)  / r(p+l)r(q). 


106 


From  Peizer  and  Pratt  (1968) , we  get 

Ic(P,q+l)  - It(p,q)  = ep(l-e)q  r(p+q)  / r(p)r(q+l). 

By  substituting  Ic  (p+1 , q+  (k-1)  ) for  I£(p+l,q+k)  using  the  above 
formula,  we  get  a general  formula  for  h(k). 
h(k)  = IE  (p,  q)  - I£(p+l,q+k)  = K * (1  - (p+q)/q  e - (1- 

e)L(k,e) ) , 

where  L(k,e)  is  a sum  of  k-1  finite  positive  terms. 

Note  that  as  k increases  by  one,  a new  positive  term  is  added 
to  L(k,e)  without  changing  existing  terms.  Since  L(k,e) 
becomes  larger  as  k increases,  h(k)  < 0 for  smaller  e.  So, 
the  interval  of  e,  where  (p,q)  gives  a higher  confidence 
factor,  is  extended  as  k increases. 

For  k=2 , 

= Ic(p,q)  - Ic(p+l,q+2)  = K * (1  - (p+q)/q  e - 
( (P+q+1) (P+q)  / (q+1) q)  (l-e)e). 

If  P/q  — 1/2,  and  p,  q are  large  enough,  then  for  e > 1/3, 

h(2)  < o.  As  p/q  increases,  (p+q)/q  increases.  Hence,  /3 

decreases . 

a 


By  Lemma  4 . 6 we  know  that  there  is  no  (p,q)  which  gives 
the  worst  possible  confidence  factor,  6 , over  all  e,  b/m  < e 
< 1,  unless  q/ (p+q)  < b/m.  Since  it  is  not  practical  to 
search  for  all  consistent  Beta  priors  and  compare  them  to  find 
a (P/q)  qiving  the  highest  <5  for  each  possible  e,  we  may 
consider  a simple  suboptimal  solution  strategy  that  gives  a 


107 


near-optimal  solution.  For  example,  fixing  p at  a maximum 
level  first,  and  then  finding  a minimum  q,  given  p,  is  one 
strategy.  Fixing  q at  a minimum  level,  and  then  finding  a 
maximum  p,  given  q,  can  be  another  strategy. 

Since  a maximum  possible  p is  a very  large  indefinite 
number,  we  would  be  better  off  to  fix  q at  the  minimum  level, 
and  then  find  a maximum  possible  p with  the  given  q. 

In  the  following  we  give  our  main  result  of  this 
subsection  based  on  the  above  strategy. 

Theorem  4.7:  For  p,q  e and  b/m  < e < 1,  where  1 < b < m-1, 
then 

(p,q)  e S (b,m) 

only  if  q > 4m  ( 1-e ) ( e -b/m)  , where  e is  the  root  of  the 
equation  -2 ( 1-e ) In ( 1-e ) = e - b/m. 

Let  q be  the  minimum  q satisfying  the  above  condition, 
and  let  p*  be  the  maximum  consistent  p given  q".  Then  (p*,q*) 
is  a consistent  Beta  prior  which  gives  the  worst  possible 
confidence  factor  for  at  least  a certain  nonempty  interval  of 
e . 

After  testing  with  m2  samples  obtaining  b2 
misclassif ications , then 

Pr°b{  0 > e } < <5,  where  1 - Ic  (p*+b2,  q*+m2-b2)  s 6 < 6. 
That  is,  (p*,q*)  gives  a lower  bound  for  <5. 


108 


Proof:  Theorem  4.7  follows  from  Lemma  4.3,  4.4,  4.5  and  4.6. 

□ 


Consider  an  example  of  a training  sample  of  size  m = 40, 
and  b = 8 misclassif  ications  found  by  the  pruned  tree. 

Further  suppose  that  we  now  take  a sample  of  size  m2  = 20  and 
get  two  ( b2  = 2 ) misclassif ications  using  the  pruned  tree. 

By  Theorem  4.7  we  need  to  find  the  minimum  q*  first. 
Since  b/m  = 8/40  = 0.2, 

e = .8189  is  the  root  of  the  equation  -2 ( 1-e ) In ( 1-e ) = e 

- 0.2. 

Hence,  q > 4*4  0 ( 1- . 8 189 ) (.8189-. 2)  = 17.933,  and  q"  = 18. 

We  now  need  to  find  p”  given  q*  = 18.  For  p = 2,  and  e 
= 0.8189, 

Pr°b{  0 > e } = 1 - 1,(2, 18)  = (1- . 8189)  18  ( 1+18* . 8189)  = 
6.911  x 10'13  > exp  ( -2  ( e -b/m)  2m)  = exp  ( -2  ( 0 . 8189- . 2 ) 2*40)  = 
4.919  x 10~u. 

So,  (p , q)  = (2,18)  is  not  a consistent  Beta  prior.  Hence,  by 
Lemma  4.4,  p"  = 1 given  q*  = 18,  and  for  e = 0.2, 

Prob{  0 > 0.2  ) < <S, 

where  1 - I£  (p”+b2,  q*+m2-b2)  = 1-I2 (1+2 , 18+20-2)  = 6 < 6. 

Evaluating  the  left  term  gives 

Pr°b{  0 > 0.2  } < 6,  where  1-I2(3,36)  = .0113  = 6 < S. 


109 


So  the  probability  that  the  error  of  the  pruned  tree  is 
greater  than  or  equal  to  0.2  is  less  than  <5,  where  6 is 
bounded  below  by  0.0113. 

For  small  e , we  may  improve  the  above  lower  bound  by 
finding  a consistent  Beta  prior  for  larger  p values.  For 
example,  if  p=2 , the  smallest  consistent  q is  20.  In  this 
case , 

<5  = 1-  Ie  (p*+b2,  q*+m2-b2)  = 1-1  2 (2  + 2 , 20+20-2  ) = . 0244  . 

By  continuing  the  search,  we  can  improve  the  lower  bound  for 
smaller  e.  For  larger  e,  however,  such  as  e > 3/4,  (p,q)  = 
(2,20)  does  not  improve  the  bound. 

For  e = 0.8,  Prob{  0 > 0.8  ) < <5,  where  5.  = l-l  8(3,36) 
= .0000  < S. 

As  we  have  seen  in  the  above  illustration,  the  range  of 
e,  where  (p,q)  gives  higher  confidence  factor,  depends  on  many 
parameters,  such  as  the  increase  in  q by  the  unit  increase  in 
P (i*ew  "k"),  p,  q,  p/ q ratio,  b,  m,  b2,  and  m2,  etc. 

An  upper  bound  of  <5  will  be  given  in  the  next  subsection. 

4.2.3  A General  Error  Bound 

1°  this  subsection  we  develop  a general  bound  on 
Prob{  0 > e } 

that  requires  no  assumption  on  the  prior  distribution  or  on 
the  domain  of  interest.  This  bound  also  ignores  any 
information  that  may  have  been  obtained  during  training.  With 


110 


a pruned  decision  tree,  training  information  is  not  often 
useful  since  we  implicitly  assume  that  there  are  many  chance 
occurrences  on  the  training  sample.  So  the  bounds  we  now 
develop  are  appropriate  especially  when  the  decision  tree  is 
pruned  or  when  the  training  domain  is  different  from  the 
testing  domain. 

Suppose  there  are  b misclassif ications  out  of  m 
independent  examples  using  the  pruned  decision  tree.  Here  b 
and  m correspond  to  b2  and  m2  of  previous  subsections, 
respectively. 

Theorem  4.8:  (Bound  on  the  Posterior  Estimate  of  the 

Confidence  Parameter] 

For  any  b/m  < e , 

Prob{  0 > e } < exp (-2 (e-b/m)2m)  . 

Proof:  Theorem  4.8  follows  from  Lemma  4.2. 

□ 


A direct  application  of  Theorem  4.8  gives  the  following. 

Corollary  4.9:  (Confidence  Estimate  for  an  Error  Range) 

For  any  b/m  < e and  0 < e ' < b/m, 

Pr°b{  e'<0<e}>i  - exp  (-2  (b/m-e  ) 2m)  - exp  (-2  (b/m-e  ' ) 2m. 


Ill 


Instead  of  Hoeffding's  inequality,  we  may  use  more 
improved,  but  a more  complicated  bound  from  Johnson  and  Kotz 
(1969)  . 


Lemma  4.10:  (Johnson  and  Kotz  1969^ 

Let  xx,  x2,  ...  , xn  be  independent  random  variables  with 
0 < x,  < 1 and  E[xj  = m,  for  i = 1,  2,  ...,  n, 
x = {£,xi)  /n. 

i- 1 

Then, 

Prob{  x - n > c ) < exp(-2nc2  - (4/3)nc*) 
or 

Pr°b{  x - n > c } < exp(-2nc2/2/x(l-M)  - (4/9)nc"). 

The  following  result  can  be  directly  derived  from  Lemma 
4.10  and  Theorem  4.8. 

Theorem  4.11:  (Improved  Bounds  on  the  Posterior  Estimate  of 

the  Confidence  Parameter) 

For  any  b/m  < e , 

Pr°b { 0 > e } < exp ( -2 (e -b/m)  2m  - (4/3 ) (e-b/m)  *m) 
or 

Pr°b{  0 > e ) < exp  ( - ( e -b/m)  2m/ ( 20  ( 1-0 ) ) - (4/9)  (e -b/m)  *m)  . 

We  can  apply  the  above  results  to  estimate  the  error  of 


any  pruned  decision  trees. 


112 


Suppose  we  have  16  test  examples  and  two  of  them  are 
misclassif ied  in  the  pruned  decision  tree.  Under  the 
assumption  of  a uniform  prior,  Theorem  4.0  gives 


Prob{  0 * 0.2  | 2,  16  } = to?  (0.2)  k (1-0. 2) 17  ~k  = 0.3096. 

k=Q 

and 

Prob{  0 £ 0.3  | 2,  16  } = £c£7(0.3)*(1-0.3)17-*  = 0.07739. 

ic=o 


The  probability  that  the  error  of  the  learned  concept  with 
pruning  is  greater  than  20%  and  30%  is  .3096  and  0.07739, 
respectively. 

A general  bound  with  no  assumption  on  the  prior 
distribution  is  given  by  Theorem  4.8  and  Theorem  4.11. 
Theorem  4 . 8 gives 

Prob{  0 > 0.2  } < exp (-2 (0. 2-2/16) 21 6)  = .83527 

and 


Pr°b{  0 > 0.3  } < exp (-2(0.3-2/16)z16)  = .37531. 

Theorem  4.11  gives 

Pr°b{  0 > 0.2  } < exp (-2 (0.2-2/16)216  - (4/3)  (0.2- 

2/16) *16)  = .8347 
and 

Pr°b{  0 > 0.3  } < exp(-2 (0. 3-2/16)216  - (4/3) (0.3— 

2/16) a16)  = .3678. 


113 


4.3  An  Application 

In  Section  4.2  we  give  a lower  bound  for  the  worst 
possible  confidence  factor,  6,  and  a general  upper  bound  for 
<5.  In  this  section  we  combine  these  results  to  obtain  a range 
where  <5  exists. 

Suppose  a training  sample  of  size  20  is  used  in  building 
a decision  tree,  and  then  the  decision  tree  is  pruned  by  a 
pruning  technique,  and  gives  6 misclassif ications . Further 
suppose  the  pruned  decision  tree  is  tested  by  16  independent 
examples,  and  two  misclassif ications  are  obtained. 

Since  b/m  = 6/20  = 0.3,  e = .8568  is  the  root  of  the 
equation  -2 ( 1-e ) In ( 1-e ) = e - 0.3. 

Hence,  q > 4*20 ( 1- . 8568 ) (.8568-. 3)  = 6.42,  and  by  Theorem 
4.7,  q*  = 7. 

We  now  need  to  find  p*  given  q*  = 7 . For  p = 2,  and  e = 
0.8568, 

Prob{  0 > e } = 1 - Ic  ( 2 , 7 ) = ( 1- . 8568 ) 7 ( 1+7* . 8568 ) = 
0.0000084  > exp (-2 (e -b/m)  2m)  = exp ( -2  ( 0 . 8568- . 3 ) 2*20) 

.0000041. 

So,  (p , q)  = (2,7)  is  not  consistent  Beta  prior.  Hence  by 
Lemma  4.4,  p =1  given  q = 7,  and  for  e - 0.3, 

Prob{  0 > 0.3  } < <5, 

where  1 - Ic  (p*+b2,  q"+m2-b2)  = 6 = 1-1  3 ( 1+2 , 7+16-2 ) = 


6. 


.0157  < 


114 


We  now  give  an  upper  bound  for  <5.  Since  Theorem  4.8  and 
4.11  can  be  applied  in  any  situation  with  independent  random 
variables,  xir  these  two  theorems  can  give  an  upperbound  for 

6. 

By  Theorem  4.8, 

Prob{  0 > 0.3  } < exp (-2 (0.3-2/16)216)  = .37531. 
and  by  Theorem  4.11, 

Pr°b{  0 > 0.3  } < exp (-2 (0.3-2/16)216  - (4/3) (0.3- 

2/16)*16)  = .3678. 

So  a range  of  6 is  obtained,  .0157  < 6 < .3678. 

That  is,  the  worst  possible  probability  that  the  learned 
concept  with  pruning  has  an  error  greater  than  30%  could  be  as 
high  as  .3678,  but  not  less  than  .0157.  In  this  case,  since 
£ is  small,  we  can  increase  the  lower  bound  by  finding 
consistent  Beta  priors  for  higher  p. 


CHAPTER  5 

AN  INVESTIGATION  ON  THE  CONDITIONS  OF  PRUNING 
5.1  Introduction 

Pruning  leads  to  concept  simplification.  When  does 
pruning  lead  to  better  predictions?  In  this  chapter  we 
develop  conditions  under  which  pruning  is  necessary  to  obtain 
better  prediction  accuracy.  The  analysis  of  this  chapter 
implicitly  assumes  that  there  may  be  some  hidden  attributes 
(Spangler,  et  al . 1989,  called  this  case  the  case  of 

"inconclusive  data") . That  is,  the  current  attribute  set  does 
not  entirely  determine  its  classification.  Schaffer  (1991) 
gives  a result  on  the  conditions  under  which  overfitting  (eg.  , 
an  unpruned  tree)  does  not  decrease  predictive  accuracy.  We 
give  a different  view  of  overfitting  and  generalize  his 
result.  We  then  apply  this  result  to  our  specific  pruning 
situation.  Here  we  consider  prediction  accuracy  of  true  class 
as  the  measure  of  the  merit  of  pruning.  In  Section  5.2  we 
give  and  analyze  the  fundamental  situation  where  pruning 
occurs.  Following  Schaffer,  in  Section  5.3  we  perform  a 
Bayesian  analysis  for  samples  of  size  three  to  find  the 
conditions,  under  which  pruning  increases  prediction  accuracy 
as  well  as  yielding  concept  simplicity.  Finally  in  Section 


115 


116 


5.4  we  give  a generalization  of  the  results  in  Section  5.3  for 
the  larger  sample  sets. 


In  an  empirical  study,  Quinlan  (1987a)  found  that  pruning 
increases  the  accuracy  of  the  learned  concept.  Mingers 1 
(1989)  empirical  study  showed  that  pruning  improves  the 
accuracy  by  20%  to  25%.  Schaffer  (1991)  theoretically 
investigated  pruning  for  very  small  sample  sizes.  We  first 
summarize  his  results  in  a decision  tree  persepective . 

Consider  the  simplest  case  of  a decision  tree,  where  a 
tree  is  a binary  tree  having  only  one  node.  We  assume  that 
there  is  no  measurement  noise  in  the  instance  space.  We 
define  measurement  noise  as  the  error  occurring  when  we 
measure  the  values  of  attributes  and  classifications.  We  also 
assume  an  equally  likely  sampling  distribution  with 
replacement. 

Schaffer's  Fundamental  Situation: 

There  are  4 possible  decision  trees  with  at  most  one 
node.  The  values  of  the  attribute  are  0 and  1.  P and  N 
denote  class  labels. 

tree  tree  #2  tree  #3  tree  #4 


5.2  Fundamental  Situation  of  Pruning 


# 

0 / \ 1 

P N 


# 

0 / \ 1 


N P 


P 


N 


117 


Suppose  the  true  probabilities  of  class  P for  0 and  1 are 
Po  and  p: . Then  the  errors  of  each  tree  are,  respectively, 


ex  = 

( 

1 

- Po  + 

Pi 

) 

/ 

2 

for 

tree 

#1, 

e2  = 

( 

Po 

+ 1 - 

Pi 

) 

/ 

2 

for 

tree 

CM 

e3  = 

( 

1 

- Po  + 

1 

- 

Pi 

) / 2 

for 

tree 

#3,  and 

e*  = 

( 

Po 

+ Pi  ) 

/ 

2 

for 

tree 

#4. 

Without  loss  of  generality,  consider  a situation  where  we 
have  to  decide  whether  to  prune  tree  #1  or  not.  By  comparing 
the  errors  of  each  pruned  and  unpruned  decision  tree  we  can 
conclude  the  following.  Note  that  ex  is  greater  than  e3  when 
Pi  > 0.5,  and  that  e3  is  greater  than  e<,  when  p0  < 0.5. 

1*  If  Po  < 0.5  and  p3  > 0.5  then  pruning  tree  #1 
increases  its  prediction  accuracy  since  e3  is  greater  than  e3 
and  eA . in  this  case,  pruning  increases  the  prediction 
accuracy  by  removing  a less  reliable  branch.  This  branch 
reflects  chance  occurrences  rather  than  representing  a true 
underlying  relationship. 

2.  If  p0  < 0.5  and  pt  < 0.5  then  the  prediction  accuracy 
depends  on  the  label  of  the  pruned  tree.  For  example,  if  the 
label  of  the  pruned  tree  is  N for  the  unpruned  tree  #1,  then 
pruning  increases  the  prediction  accuracy.  Otherwise  pruning 
decreases  the  prediction  accuracy. 

3.  If  Po  > 0 . 5 and  px  > 0.5  then  the  prediction  accuracy 
depends  on  the  label  of  the  pruned  tree.  For  example,  if  the 
label  of  the  pruned  tree  is  P for  the  unpruned  tree  #1,  then 
pruning  increases  the  prediction  accuracy. 


118 


4.  If  p0  > o . 5 and  px  < 0.5  then  the  unpruned  tree  #1  has 
a larger  prediction  accuracy  regardless  of  the  label  of  the 
pruned  tree  since  ex  is  less  than  e3  and  eA . In  this  case,  the 
branch  reflects  the  true  underlying  relationship. 

As  we  see  in  the  above  result,  pruning  may  increase  the 
prediction  accuracy  by  removing  branches  constructed  by  the 
examples  occuring  by  chance  ( eg.,  the  case  of  p0  < 0 . 5 and 
Pi  > 0.5  ).  Note  that  the  decision  will  be  reversed  if  we 
have  to  decide  whether  to  prune  tree  #2  or  not.  The  above 
analysis,  however,  does  not  use  any  information  from  the 
training  sample  or  any  prior  observations.  In  the  next 
subsection,  we  summarize  certain  basic  results  using  a 
Bayesian  analysis. 

5..-. 3 — A Bayesian  Analysis  on  the  Conditions 

Where  Pruning  is  Useful 

Schaffer  (1991)  gives  a result  for  the  above  example 
using  Bayesian  methods.  Consider  a training  sample  of  size 
three . 

Sample  #1:  { (0,P),  (0,P),  (l,N)  } 

In  this  situation  we  have  a choice  between  a simple 
hypothesis  (tree  #3)  having  error  less  than  50%  over  the 
training  sample  and  a statistically  less  reliable,  but  better 
fitting  complex  hypothesis  (tree  #1)  . Schaffer  calls  this 
case  an  equivocal  case.  He  investigates  the  conditions  for 


119 


equivocal  cases  where  the  best  fittinq  unpruned  decision  tree 
leads  to  a decrease  in  the  prediction  performance  over  the 
best  fitting  pruned  decision  tree. 

We  begin  with  a precise  definition  of  an  equivocal  case. 

Definition  5.0:  (An  equivocal  case)  Let  n be  the  size  of  the 
training  sample.  Let  q be  the  number  of  examples  of  the 
larger  class  in  the  sample.  If  q is  less  than  n and  strictly 
greater  than  n/2,  then  a q:n-q  partitioned  sample  of  size  n is 
called  an  equivocal  case. 

Let  Sp  be  the  strategy  of  choosing  the  best  fitting 
pruned  tree  in  an  equivocal  case  and  Su  be  the  strategy  of 
choosing  the  best  fitting  unpruned  tree. 

5.3.1  A Bayesian  Analysis 

In  the  following,  we  restrict  our  attention  to  cases 
involving  samples  of  size  three.  Suppose  the  prior  joint 
distribution  for  (p0,  pj  is  uniform  on  the  unit  square.  With 
this  assumption  and  the  assumption  of  a uniform  distribution 
over  the  instance  space,  a Bayesian  analysis  gives  the  average 
prediction  accuracy  of  the  decision  trees  chosen  by  Sp  and  Su 
for  equivocal  training  sets.  Details  of  the  calculation 
process  are  as  follows. 

Since  Sp  and  Su  ignore  the  order  of  sampling,  there  are 
just  four  possible  equivocal  training  sets  of  size  three. 


120 


Si 

= { 

(0,P) , 

(0,P) , 

(1,N) 

S2 

= { 

( 0 , N)  , 

(0,N) , 

(1,P) 

S3 

= { 

(0,P) , 

(1,N) , 

( 1 / N) 

S* 

= { 

( 0 , N)  , 

(IrP) , 

(1,P) 

Let  k3  = P{  ( 0 , P)  }2P{  ( 1 , N)  } , 
k2  = P{  ( 0 , N)  }2P{  (1,P)  } , 
k3  = P{  ( 1 , N)  }2P{  ( 0 , P)  } and 
kA  = P{  ( 1 , P)  }2P{  ( 0 , N)  > . 

Here  P{  (0,P)  } = p0  / 2, 

P{  (0,N)  } = (l-p0)  / 2, 
p{ ( 1 / P) } = Pi  / 2 and 
P<  ( 1 / N)  } = (1-Pl)  / 2 

since  we  assumed  P{  (0,  .)  > = .5.  Then  the  probabilities  of 
each  sample  over  the  space  of  equivocal  training  sets  are 

P(SJ  = K / K for  i = 1,2, 3, 4,  where  K = kx  + k2  + k3 

+ k4 . 

Let  A(TJ  be  the  accuracy  of  tree  #i,  where  i = 1,2, 3, 4, 
as  measured  by  the  probability  of  its  predicting  the  true 
classification.  Then 

A(T3)  = (p0  + (1-Pl)  ) / 2 , 

A(T2)  = ( (1-Po)  + Pl)/2, 

A(T3)  = (p0  + pJ/2  and 
A(T,)  = (l-(p0  + pj  )/2. 

Let  A(Sp)  and  A(SU)  be  the  average  accuracy  of  Sp  and  Su. 
Since  Su  chooses  Tx  given  sx  or  S3  and  T2  given  S2  or  S4, 

A(SU)  = (P(S1)  + P(S3))*A(T:)  + (P(s2)  + P(SJ  ) *A(T2)  . 


121 


Similarly,  since  Sp  chooses  T3  given  Sx  or  S4  and  T4  given  S2  or 
S3f 

A(SP)  = (P(Si)  + P(SJ)*A(T3)  + (P(S2)  + P(S3)  ) *A(T„)  . 

Then  the  difference  of  expected  prediction  accuracy  is 
E [A (Sp)  -A (Su)  ] 

= 0 0 (A(Sp)  ~ A(SJ)  f(Po,Pi)  dp0  dpj. 

= (1/128  )J  £ ( (Po) 2 ( 1-Pi)  ~ (l-Po)2(Pi)  ) (2Pl-l)  + ((Pl)2(l- 
Po)  - (l-Pi)2(Po)  ) (2p0-l)  dp0  dpx 
= - 10/288  < 0 

since  f(p0/Pi)  = 1 over  the  unit  square. 

This  analysis  shows  that  Su  yields  decision  trees  with  a 
higher  average  prediction  accuracy  than  Sp  (Schaffer  1991) . 
That  is,  under  the  conditions  described,  pruning  does  not 
increase  the  prediction  accuracy. 

As  we  see  in  the  above  analysis,  the  result  depends  on 
the  assumed  prior  distribution  for  p0  and  px. 

5.3.2  Disguised  Bayesian  Analysis  Without  Noise 

In  this  subsection  we  assume  that  values  for  p0  and  p,  are 
fixed,  though  unknown  to  us.  We  may  calculate  the  performance 
of  decision  trees  selected  by  the  two  strategies  for  each 
equivocal  observation  sequence  and  average  these  performance 
figures,  weighted  by  the  chance  of  various  observation 
sequences  arising  under  the  assumed  values  of  p0  and  px. 
Schaffer  (1991)  called  this  type  of  analysis  a "Disguised" 
Bayesian  Analysis. 


122 


Here  we  use  A(SP)  and  A(SJ  as  defined  in  the  Subsection 
5.3.1.  We  calculate  A(Sp)  and  A(SU)  for  each  assumed  values 
of  p0  and  p:. 

We  say  that  Sp  is  preferable  if  A(Sp)  is  greater  than  A(SU)  , 
and  vice  versa.  By  performing  calculations  for  all  possible 
pairs  of  p0  and  p:  values,  we  get  the  following  results. 

Sp  is  preferable  for  points  near  the  p0  = p:  line  in  the 
Po  > 0.5  and  p:  > 0.5,  or  p0  < 0.5  and  p:  < 0.5  regions.  But 
the  acceptable  range  grows  as  the  parameters  approach  0 or  1. 
The  Sp  preferred  regions  are  much  smaller  than  those  in  which 
Su  is  preferable  (Schaffer  1991) . Approximate  calculation 
shows  that  the  Su  area  is  65%  of  the  total  region. 

Note  that  Schaffer  does  not  consider  the  guality  of  the 
decision  (i.e.,  the  degree  to  which  the  prediction  accuracies 
for  the  two  strategies  differ  at  a given  (p0,  pj  point) . He 
also  does  not  consider  the  distribution  of  p0  and  p:. 

5.3.3  Disguised  Bayesian  Analysis  With  Noise 

For  more  realistic  learning  environments,  we  may  consider 
the  possibility  of  erroneous  observations  and  mis- 
classifications. 

Let  ed  be  the  level  of  noise  in  the  attributes  (i.e.,  the 
description  noise) , the  probability  that  the  true  value  of  an 
attribute  is  changed  to  some  other  value.  Let  ec  be  the  level 
of  noise  in  the  classification  (i.e.,  the  classification 
noise)  , the  probability  that  the  true  classif icaiton  is 


123 


changed  to  some  other  classification.  Then  the  probabilities 
of  making  the  four  observations  change  as  follows.  Note  that 
each  term  in  the  following  equations  comes  from  the  four 
possible  combinations  of  noise,  i.e.,  the  correct  observation, 
the  observation  with  classification  noise  only,  the 
observation  with  description  noise  only,  and  the  observation 
with  description  and  classification  noise. 

P<(0,P)}  = [ (l-ed)  (l-ec)p0  + (i-ed)ec(l-Po)  + ed(l-ec)Pl  + 

edec(l-Pi)  ]/2  . 

P{  (0,N)  } = [ (l-ed)  (l-ec)  (1-po)  + (l-ed)ecp0  + ed(l-ec)  (1-pJ 
+ edecp1]/2 . 

P{(1,P)}  = [ (l-ed)  (l-ec)Pl  + ( l-ed)  ec  ( 1-Pl)  + ed(l-ec)p0  + 
®d®c  (i~ Po)  ] / 2 • 

P{  (1,N)  } = [ (l-ed)  (l-ec)  (l-px)  + (l-ed)ecp1  + ed(l-ec)  (l-p0) 
+ edecp0]/2. 

Using  a similar  analysis  procedure  as  in  the  previous 
subsection,  Schaffer  showed  the  following: 

1)  The  classification  noise  has  a nearly  negligible  effect  on 
the  relative  merit  of  Sp  and  Su. 

2)  The  description  noise  has  a strong  effect,  however,  even  at 
moderate  levels  (Schaffer  1991)  . 

However,  Schaffer  does  not  investigate  the  case  of  ec  > 

. 5 or  ed  > .5.  We  have  extended  the  analysis  to  the  case  of 
ec  > . 5 or  ed  > .5.  The  analysis  has  been  done  by  using  larger 
errors  in  Schaffer's  formula.  The  following  was  found. 


124 


3)  If  the  classification  noise  is  greater  than  0.5,  then  it 
has  a strong  effect  on  the  relative  merit  of  Sp  and  Su.  The 
Sp  preferred  region  is  close  to  65%  regardless  of  the  level  of 
classification  noise. 

4)  If  the  description  noise  is  greater  than  0.5,  then  the  Sp 
preferred  region  is  close  to  100%  regardless  of  the  level  of 
description  noise. 

This  result  is  consistent  with  Quinlan's  (1986)  empirical 
results  at  lower  noise  levels.  Quinlan's  result  shows  that 
the  description  noise  has  a strong  effect  at  lower  noise 
levels.  However,  the  classification  noise  is  more  dangerous 
for  higher  noise  levels  (the  case  of  ec  > .7). 

j?  • 4 A Generalization  on  the  Conditions 

Where  Pruning  is  Useful 

In  this  section  we  generalize  Schaffer's  results  for 
larger  training  sets.  Note  that  pruning  techniques  are  often 
applied  to  statistically  unreliable  branches.  This 
observation  leads  to  the  following  conjecture.  Our  conjecture 
is  that  the  pruning  decision  (i.e.,  determining  the  relative 
mer-'-*:  Sp  over  SJ  depends  on  the  number  of  examples  on  each 
branch  and  on  the  ratio  of  the  number  of  examples  of  the 
larger  class  to  the  size  of  the  training  set.  We  call  this 
ratio  the  "skewness  of  the  training  set",  and  we  begin  with  a 
precise  definition  of  the  skewness. 


125 


Definition  5.1:  (The  Skewness  of  the  Training  Set^  Let  n be 
the  size  of  the  training  sample.  Let  q be  the  number  of 
examples  of  the  larger  class  in  the  sample.  Then  the  ratio 
q/n  is  called  the  skewness  of  the  training  set. 


If  the  skewness  of  the  training  set  is  large,  then  the 
branch  of  the  smaller  class  is  based  on  a small  number  of 
examples  and  it  may  be  considered  as  chance  occurrences  rather 
than  true  underlying  relationship.  By  removing  these  chance 
occurrences,  pruning  may  improve  the  prediction  accuracy  over 
the  instance  space. 

Consider  sample  sets  Su  S2,  S3,  and  S,  given  in 

Subsection  5.3.1. 

51  = { (0,P) , (0,P) , ( 1 , N)  } , 

52  = { ( 0 , N)  , ( 0 , N)  , (1,  P)  } , 

53  = { (0,P) , ( 1 , N) , ( 1 , N)  } , and 

Sa  = { ( 0 , N)  , (1,P)  , (l,P)  } . 

The  number  of  examples  of  the  larger  class  is  two  for  each 
training  set  and  the  sample  size  is  three.  In  this  case,  the 
skewness  of  the  training  set  is  2/3  = 0.667. 

In  the  following  we  assume  that  there  exists  a consistent 
decision  tree,  of  given  attributes,  for  the  training  set. 


5 -4.1  Varying  the  Skewness  of  the  Training  Set 

In  this  subsection  we  increase  the  skewness  of  the 
training  sets  to  investigate  the  effect  of 


an  unbalanced 


126 


sampling.  Consider  training  sets  of  size  four.  We  assume 
there  is  no  noise.  Since  Sp  and  Su  ignore  the  order  of 
sampling,  there  are  just  four  equivocal  training  sets  of  size 


four. 

They 

are  ; 

Si 

= { 

(0,P),  ( 0 , P)  , ( 0 , P)  , 

(1,N) 

) , 

s2 

= { 

( 0 , N)  , ( 0 , N)  , ( 0 , N)  , 

(1,P) 

} , 

s3 

= { 

( 0 , P)  , ( 1 / N)  , ( 1 , N)  , 

( 1 / N) 

} , and 

s* 

= { 

( 0 , N)  , (1,P),  (1,P), 

(1,P) 

> • 

In  this 

case 

the  skewness  of  each 

training  set  is  3/4  = 0.750. 

In 

the 

case  of  samples  of 

size  five,  there  are  eight 

possible  equivocal  training  sets.  Four  of  them  are  4:1 
(i.e. , four  samples  are  in  one  class  and  only  one 
is  in  the  other  class),  and  the  remaining  four  are  3:2 
partitions.  If  we  consider  only  the  4:1  partitions,  then  the 
skewness  is  4/5  = 0.800.  We  refer  to  this  case  as  "pure 
skewness".  If  we  consider  all  possible  equivocal  training 
sets  of  size  five,  then  the  skewness  can  be  calculated  in  at 
least  two  different  ways.  One  is  the  simple  average  of  the 
pure  skewnesses.  The  other  is  a weighted  average  of 
skewnesses  based  on  the  number  of  permutations  of  each 
training  set.  We  refer  to  these  cases  as  "composite 
skewness".  Thus,  for  training  sets  of  size  five  we  have 

1)  simple  average  = (4/5  + 3/5)  / 2 = .700,  and 

2)  weighted  average  = 

((  Ci5  * 4/5  + (CXA  + C2a)  * 3/5)  / ( C ,5  + CXA  + C2a)  = .667. 


127 


Here  Ckm  is  the  number  of  ways  of  sampling  k out  of  m things. 
The  number  of  permutations  of  each  training  set  is  calculated 
as  follows.  Here  we  must  consider  the  order  of  sampling. 
Suppose  we  have  a q:r  partitioned  training  set,  where  q and  r 
are  positive  integers.  We  know  that  counting  the  number  of 
equivalent  training  sets  is  considered  as  a two  step  process. 
Step  1.  Choose  t places  out  of  q+l  places 
Step  2.  Spread  r indistinguishable  examples  to  t 
distinguishable  places  with  an  onto  function. 

Since  t varies  from  1 to  r,  the  number  of  permutations  is 

t cf-'hcr1  . 

c=l 


Let  Sample 

#2  = { 

(0, 

-P)  , 

(0,P)  , 

(0,P)  , 

(1, 

-N)  , 

(1,N) 

} , 

Sample 

#3  = { 

(0, 

■ P)  , 

(0,P) , 

(1,N) , 

(1, 

.N)  , 

(0,P) 

} , and 

Sample 

#4  = { 

(0, 

•P)  , 

(0,P) , 

(1,N) , 

(0, 

•P)  , 

(1,N) 

) • 

All  of  the  above  are  equivalent  training  sets  if  we  ignore  the 
order  of  sampling. 


For  above  case,  q - 3 and  r = 2.  So,  the  total  number  of 
ordered  samples  corresponding  to  this  equivocal  training  set 
is  C^  + C24  = 10. 

In  this  subsection  we  consider  "pure  skewness"  only.  We 
will  give  a detailed  analysis  for  the  "composite  skewness" 
case  in  the  next  subsection. 

Change  kl7  k2,  k3  and  k4  as  required.  For  example,  if  a 
training  set  Sx  consists  of  r of  the  (0,P)  examples  and  1 of 
the  ( 1 , N)  example,  then 


128 


ki  = P{  (0,P)  }rP{  ( 1 , N)  }K 

Using  a similar  analysis  procedure  as  given  in  Subsection 
5.3.2,  with  the  new  kX/  k2,  k3  and  k4  values,  we  determine  the 
following  (See  Table  5.1). 

1)  As  we  increase  the  simple  skewness  ratio,  the  relative 

over  Su  is  increased  in  a log-like  pattern.  In 
other  words,  the  relative  merit  increases  rapidly  for  smaller 
values  of  the  skewness  ratio  and  increases  slowly  for  larger 
values . 

2)  However,  the  least  upper  bound  value  of  the  Sp 
preferred  region  does  not  exceed  50%. 

Table  5.1:  Skewness  vs.  Size  of  Sp  preferred  region 
Size  of  Skewness  Portion  of 

training  set  of  sample  Sp  region 


3 . 667 

4 .750 

5 .800 

6 . 833 

11  .909 

very  large  '.999 


Below  Lemma  5.2  characterizes 
preferred  region. 


.352  (Schaffer's  case) 
.414 

.445  ({4:1)  cases  only) 
.464  ({5:1)  cases  only) 
.490  ({10:1)  cases  only) 

. 500 

the  limiting  value  of  the  Sp 


129 


Lemma  5.2:  The  following  are  true. 

1)  If  Po  < 0.5  and  pj  > 0.5,  then  Su  is  preferred. 

2)  II  Po  > 0.5  and  pt  < 0.5,  then  Su  is  preferred. 

3)  If  the  skewness  becomes  close  to  1,  and  if  p0  > 0.5 
and  Pi  > 0.5,  then  Sp  is  preferred. 

4)  If  the  skewness  becomes  close  to  1,  and  if  p0  < 0.5 
and  p:  < 0.5,  then  Sp  is  preferred. 

Proof:  1)  Let  k be  the  number  of  examples  represented  by  the 
larger  class  and  let  K be  the  sum  of  the  probabilities  of 
observing  each  training  set  defined  in  5.3.1.  Then 
K*  ( A ( Su)  - A ( Sp)  ) 

= ( (Po/2)k(  (l-pJ/2)  + ( (l-Pi)/2)k(p0/2)  ) (Po+l-Pl)/2 

+ ( ( (l~Po)/2)k(p1/2)  + (P!/2)k(  (l-p0)/2)  ) (l-Po+pJ/2) 

~ ( (Po/2)k(  (l-pJ/2)  + (Px/2)k(  (l-p0)/2)  ) (Po+Pl)/2) 

- ( ( ( 1-Pa) /2  ) k (Pl/2  ) + ( (l~Pi)/2)k(Po/2)  ) (l-(Po+Pl)/2)  . 

= - ( (2p!-l)/2)  ( (Po/2)k(l-Pl)/2  - ( (l-p0)/2)k(Pl/2)  ) 

- ( (2p0-l)/2)  ( (Pl/2)k(l-p0)/2  - ( (l-Pl)/2)k(p0/2)  ) . 

Suppose  p0  < 0.5  and  px  > 0.5.  Then 

(Po/2)k(  (l-pJ/2)  < ( (l-p0)/2)k(p1/2)  and 
(Pi/2  ) k ( (l-p0)/2)  > ( (l-p1)/2)k(p0/2)  . 

So,  A (Su)  - A ( Sp)  > 0. 

2)  Similar  reasoning  gives  this  result. 

3)  Let  fi(k)  = (p0/2)k(  (l-Pl)/2)  - ( (l-p0)/2)k(Pl/2)  and 
f2(k)  = (Pi/2)k(  (l-Po)/2)  - ( (l-p1)/2)k(p0/2)  . Then 

f ! (k)  > 0 if  k > ln(  (1-Pl)/Pl)  / In  ( ( l-p0)  /p0)  and 


130 


f2(k)  > 0 if  k > In  ( ( l-p0)  /p0)  / In  ( ( l~Pi)  /Pl)  . 

Let  k"  = max  ( In  ( ( 1-Pl) /Pl)  / In  ( ( l-Po) /Po)  , In  ( ( l-Po) /Po)  / 
ln(  (l-pj/pi)  ).  Then  for  any  Po  > 0.5  and  Pl  > 0.5,  we  can 
find  a k such  that  k > k*.  So,  for  such  a k > k*,  A(SU)  - 
A(SP)  < 0.  Hence,  if  the  skewness  becomes  close  to  1,  then  S„ 
is  preferred. 

4)  Similar  reasoning  gives  this  result. 

□ 


—•4.2  The  Effect  of  a Larger  Training  Set  and  Noise 

In  5.4.1  we  increased  the  skewness  of  the  training  set. 
In  order  to  give  a comprehensive  view  on  the  effect  of  pruning 
on  a larger  training  set,  we  consider  both  types  of  sampling 
strategies,  i.e.,  unordered  sampling  and  ordered  sampling. 
These  correspond  to  simple  average  composite  skewness  and 
weighted  average  composite  skewness  discussed  in  the  previous 
subsection,  respectively. 

5 . 4 . 2 . 1 Simple  average  composite  skewness 

First  we  consider  the  unordered  sampling  case.  In  this 
case,  when  we  increase  the  size  of  the  training  set,  several 
different  partitioning  schemes  are  possible.  For  example,  if 
the  training  set  size  is  six,  the  following  cases  are 
possible . 

1)  6:0  case:  all  six  samples  are  positive  examples,  or 
all  six  examples  are  negative  examples. 


131 


2)  5:1  case:  five  examples  are  in  the  same  class 
(positive  or  negative)  and  the  remaining  one  example  is  in  the 
other  class. 

3)  4:2  case:  four  examples  are  in  the  same  class 
(positive  or  negative)  and  the  remaining  two  examples  are  in 
the  other  class. 

4)  3:3  case:  three  examples  are  in  the  positive  class  and 
three  examples  are  in  the  negative  class. 

For  case  1)  , the  sample  strongly  supports  a simple 
hypothesis.  For  case  4),  the  given  sample  strongly  supports 
a complex  hypothesis  since  any  simple  hypothesis  cannot 
explain  more  than  50%  of  the  sample.  Cases  3)  and  4)  are 
equivocal  cases.  We  analyse  the  average  performance  of  each 
hypothesis  for  these  cases. 

We  use  a similar  analysis  procedure  as  that  used  in 
Subsection  5.3.2.  The  number  of  equivocal  training  sets,  the 
probabilities  of  observing  each  training  set  (i.e.,  k1;  k2, 
etc),  and  the  calculation  of  expected  accuracy  (i.e.,  A(Sp) 
and  A (Su)  ) are  modified  appropriately.  Since  the  ways  of 
partitioning  the  sample  are  different  depending  on  the  sample 
size,  the  calculation  formulae  are  not  uniform  for  different 
training  set  sizes.  However,  they  have  a similar  form  and 
logic,  we  explain  the  procedure  by  giving  an  example  for  a 
sample  of  size  six. 

If  we  ignore  the  order  of  sampling,  there  are  just  eight 
equivocal  training  sets  of  size  six.  They 


are 


132 


Si 

= { 

(0,P) , 

(0,P)  , 

(0,P) , 

(0,P) , 

(0,P)  , 

(1,N) 

> , 

S2 

= { 

( 0 , N)  , 

( 0 , N)  , 

(0,N) , 

(0,N) , 

( 0 , N)  , 

(1,P) 

} , 

S3 

= { 

(0,P) , 

(1,N)  , 

(1,N) , 

(1,N) , 

(1,N) , 

(1,N) 

) , 

S, 

= { 

(0,N) , 

(1/P) , 

(1,P)  , 

(1,P)  , 

(1/P) , 

(1,P) 

} , 

S5 

= { 

(0,P) , 

(0,P) , 

(0,P)  , 

(0,P)  , 

(1,N)  , 

( 1 , N) 

) , 

S6 

= { 

( 0 , N)  , 

( 0 , N)  , 

(0,N) , 

(0,N) , 

(1/P)  , 

(1,P) 

) , 

S7 

= { 

(0,P)  , 

(0,P)  , 

(1,N) , 

( 1 / N)  , 

( 1 , N)  , 

( 1 , N) 

} , and 

S8 

= { 

( 0 , N)  , 

(0,N) , 

(1,P) , 

(1,P)  , 

(1/P)  / 

(1,P) 

) • 

Then  the  probabilities  of  observing  the  training  sets  are, 
respectively, 

k:  = P{  (0,P)  }5P{  ( 1 , N)  )\ 
k2  = P{  ( 0 , N)  }5P{  (l,P)  j1, 
k3  = P{  ( 1 , N)  }5P{  ( 0 , P)  )\ 

K = P{  (1,P)  }5P{  ( 0 , N)  )\ 
k5  = P{  ( 0 , P)  }*P{  ( 1 , N)  }2, 
k6  = P{  ( 0 , N)  }*P{  (1,P)  }2, 
k7  = P<  ( 1 , N)  }*P{  ( 0 , P)  }2,  and 
k8  = P{  (1,P)  }*P{  ( o , N)  }2. 

P{ ( 0 / P)  } / P{ ( 0 # N)  } , P{  ( 1 , P)  } and  P{  ( 1 , N)  } are  defined  in  the 
same  way  as  in  Subsection  5.3.3. 

Then  the  probabilities  of  each  sample  over  the  space  of 
all  equivocal  training  sets  are 

P(Si)  = ki  / K for  i = 1,2, 3, 4, 5, 6, 7, 8 
where  K = k3  + k2  + k3  + k*  + k5  + k6  + k7  + k8 . The  prediction 
accuracy  of  tree  #i,  A(T1),  where  i = 1,2, 3,4,  is  defined  as 


133 


in  Subsection  5.3.1.  Since  Su  chooses  T3  given  S3  or  S3  or  S5 
or  S7  and  T2  given  S2  or  S4  or  S6  or  S8, 

A(SU)  = (P(Si)  + P(S3)  + P(S5)  + P(S7))*A(T1)  + (P(S2)  + 

P(SJ  + P(S6)  + P (S8)  ) *A  (T2)  . 

Similarly,  since  Sp  chooses  T3  given  S:  or  SA  or  S5  or  S8  and  T2 
given  S2  or  S3  or  S6  or  S7, 

A(Sp)  = (P(S:)  + P(SJ  + P(S5)  + P(S8))*A(T1)  + (P(S2)  + 

P(S3)  + P(S6)  + P(S7)  ) *A(T2)  . 

By  performing  calculations  for  all  possible  pairs  of  p0  and  pt 
values,  we  get  the  results  shown  in  Table  5.2. 

As  we  increase  the  size  of  the  training  sets,  the 
skewness  fluctuates  between  0.700  and  0.750.  We  can  calculate 
the  asymptotic  value  of  the  skewness  of  the  training  set  as 
follows.  First  consider  even-numbered  sample  size.  Let  2n  be 
the  size  of  the  sample,  where  n is  a whole  number.  Then 

skewness  = l/(n-l)  * ((n+l)/2n  + (n+2)/2n  + 

+ (2n-l)/2n)  = 1/ (n-1)  * (3/4)  (n-1)  = 0.750. 

Now  consider  a sample  set  of  size  2n-l.  Then 

skewness  = l/(n-l)  * (n/(2n-l)  + (n+l)/(2n-l)  + 

+ ( 2n-2 ) / ( 2n-l ) ) = l/(n-l)  * (n-1) ( 3n-2 ) / (4n-2 ) 

= ( 3n-2 ) / ( 4n-2 ) . 

By  taking  the  limit  to  n,  we  get  the  asymptotic  value  of 
skewness  equal  to  0.750. 


134 


Table  5.2:  Sample  size  vs.  Sp  region  for  unordered  sample 


Size  of 

Skewness 

Portion  of 

training  set  of  set 

Sp  region 

* ** 

ed 

3 

. 667 

.352 

. 34 

4 

.750 

.414 

. 26 

5 

.700 

. 401 

.30 

6 

.750 

.436 

.26 

7 

.714 

.431 

. 28 

8 

.750 

.455 

.25 

9 

.722 

.450 

. 27 

10 

. 750 

. 465 

.25 

very  large 

' .750 

. 500 

. 00 

*:  Sp  preferred  region  has  been  calculated  approximately  by 
taking  10, 000  points  in  the  (p^pj)  plane. 

**:  ed*  is  the  amount  of  description  noise  where  the  Sp 
preferred  region  is  approximately  equal  to  the  Su  preferred 
region.  Here  we  set  ec  = 0. 

As  we  see 

in  the  Table  5.2, 

the  effect 

of  a larger 

training  set  on  the  size  of  the  Sp  region  is  not 

monotone . 

As 

we  increase  the 

size  of  a training 

set  from  four  samples 

to 

five,  the  relative  merit  of  the  pruned  decision  tree 

decreases.  This  is  opposite  to  the  case  of  increasing  the 
sample  size  from  three  to  four.  (In  this  case,  as  we  increase 
the  size  of  a training  set  from  three  to  four,  the  relative 
merit  of  the  pruned  decision  tree  increases)  . However,  we  can 
easily  see  that  for  odd  or  even  numbered  sequences  (e.g.,the 
3, 5, 7, 9 odd  sequence  or  the  4,6,8,10  even  sequence),  the  Sp 


135 


preferred  regions  are  monotonically  increasing  in  size.  So  we 
can  say  that  the  effect  of  larger  training  set  is  not 
monotone,  and  the  relative  merit  is  influenced  by  the  skewness 
of  the  training  set  as  well  as  the  size  itself.  That  is,  for 
the  same  sample  size,  higher  skewness  gives  a higher 
percentage  of  the  Sp  preferred  region.  If  both  the  sample 
size  and  skewness  increase,  then  the  Sp  preferred  region 
increases . 

These  observations  are  generalized  in  the  following. 

Lemma  5.3:  Let  n be  an  odd  number  such  that  n > 3.  If  the 
sample  size  increases  from  n to  n+1,  then  the  Sp  preferred 
region  does  not  decrease. 

Proof:  For  any  q:r  partitioning  of  n,  a (q+1) :r  partitioning 
°f  n+1  exists.  Let  fn  be  the  difference  of  the  prediction 
accuracy  between  Sp  and  Su  for  the  sample  of  size  n.  Suppose 
p0  > 0.5  and  p:  > 0.5.  Then  fn+1  is  obtained  by  multiplying  the 
positive  terms  of  fn  by  p0/2  or  Pj/2  and  the  negative  terms  of 
fn  by  (l-Po)/2  or  (l-px)/2.  So,  if  fn  is  positive,  then  fn+1  is 
also  positive  since  p0/2  > (l-p0)/2,  and  Pl/2  > (l-Pl)/2,  and 
sum  of  multipliers  are  equal  to  0.5  in  both  cases.  Hence,  if 
A(Sp)  > A(SU)  for  n,  then  this  inequality  also  holds  for  n+1. 
This  implies  that  the  Sp  preferred  region  does  not  decrease. 
For  the  case  of  p0  < 0.5  and  px  < 0.5,  by  the  formula  of  fn 
given  in  Proof  of  Lemma  5.2,  we  multiply  the  positive  terms  by 


136 


a corresponding  larger  number  and  negative  terms  by  a smaller 
number.  Hence,  if  A(Sp)  > A(SU)  for  n,  then  this  ineguality 
also  holds  for  n+1. 

□ 


We  have  also  investigated  the  influence  of  description 
noise  and  classification  noise.  We  have  not  seen  any  notable 
evidence  that  the  effect  of  classification  noise  is  influenced 
bY  the  change  of  the  size  of  the  training  sets.  However,  the 
effect  of  description  noise  becomes  stronger  as  the  size  of 
the  training  set  increases.  That  is,  the  amount  of 
description  noise  where  the  Sp  preferred  region  is 
approximately  egual  to  the  Su  preferred  region  is  reduced  as 
the  size  of  the  training  set  increases.  The  range  of 
description  noise  for  which  Sp  has  relative  merit  over  Su  is 
extended  to  relatively  lower  noise  levels.  We  also  see  that 
the  relative  importance  of  description  noise  to  classification 
noise  does  not  change  as  we  change  the  skewness  and  the  size 
of  the  training  sets. 

These  observations  are  generalized  below. 

Lemma  5.4:  For  the  unordered  sampling  case,  the  limiting  value 
for  the  size  of  the  Sp  preferred  region  is  .500  as  the  sample 
size  goes  to  infinity. 


137 


Proof:  Let  n be  an  even-numbered  sample  size.  Then  the 

difference  of  expected  prediction  accuracy,  fn,  is 
fn  = K*  (A  (Sp)  -A  (Su)  ) 

= ( (2Pi~l)/2)  ( (Po/2  ) nl  ( 1-Pl) /2  - ((l-p0)/2)n'1(p1/2)) 

+ ( (2p0-l)/2)  ( (Pl/2  ) n"1  ( l-p0) /2  - ( (l~p1)/2)n-1(p0/2)  ) 

+ ( (2p1-l)/2)  ( (Po/2)n-2(  (l-Pl)/2)2  - ( (l-p0)/2)n-2(Pl/2)2) 

+ ( (2p0-l)/2)  ( (Pl/2)n"2(  (l-p0)/2)2  - ( (l-p1)/2)n-2(p0/2)2) 

+ ( (2Px-l)/2)  ( (p0/2)n/2+k(  (l-Pl)/2)n/2‘k  - ( (l-p0)/2)n/2+k(p1/2)n/2'k) 
+ ( (2p0-l)/2)  ( (Pl/2)n/2+k(  ( l-Po) /2  ) n/2"k  - ( (l-Pl)/2)n/2+k(p0/2)n/2-k) 

+ . . • . 

+ ( (2Pl-l)/2)  ( (Po/2)n/2+1(  ( 1-Pl) /2  ) n/z'1  - ( ( 1-po)  /2  ) n/2+1  (Pl/2  ) n/2‘1) 

+ ( (2p0-l)/2)  ( (Pl/2)n/2+1(  ( l-p0) /2  ) n/2"1  - ( ( l“Pi) /2  ) n/2+1  (p0/2  ) n/2’1) 
For  an  odd-numbered  sample  size  n,  the  terms  n/2+1  and  n/2-1 
in  the  above  would  be  changed  to  (n+l)/2  and  (n-l)/2, 
respectively.  We  focus  on  the  case  where  n is  even. 

Let  g(k)  = 

( (2Pl-l)/2)  ( ( p0/  2 ) n/2+k  ( ( 1-Pl ) / 2 ) n/2~k  - ( ( l-p0)  /2  ) n/2+k  (Pl/2  ) n/2~k) 

+ ( (2p0-l)/2)  ( (Pi/2  ) n/2+k  ( ( l-p0)  /2  ) n/2'k  - ( ( 1-pi ) /2  ) n/2+k  ( p0/2  ) n/2'k ) , 
where  k = 1,2,  ...,  n/2-1. 

By  rearranging  the  terms  we  get 

g(*)  = (Pi/2)n/2-k(  (l-po)/2)n/2'k  * 

( ( (2p0-l)/2)  (Pl/2)2k  - ( (2pi-l) /2)  ( (l-p0)/2)2k) 

+ ( Po/ 2 ) n/2~k  ( ( 1-Pl ) /2  ) n/2"k  * 

( ( (2pi-l)/2)  (p0/2)2k  - ( (2p0-l)/2)  ( (l-Pl)/2)2k)  . 


138 


Suppose  p0  > 0.5  and  p-L  > 0.5.  Without  loss  of  generality, 
further  suppose  p1  > p0.  Then,  for  any  k, 

( (2Pl-l)/2)  (p0/2)2k  - ( (2p0-l)/2)  ( (l-Pl)/2)2k  > 0. 

If  k > ln(  (2p0-l)/(2Pl-l)  ) / 2 ln(Po/(l-Pl)  ) , then 

( (2p0-l)/2)  (p1/2)2k  - ( (2p1-l)/2)  ( (l-p0)/2)2k  > 0. 

Let  k be  the  smallest  integer  not  smaller  than 
ln(  (2Po-l)/(2Pl-l)  ) / 2 ln(Po/(l-Pl)  ) . 

Then  g(k)  > 0 for  all  k > k*. 

Since  g(k+l)  is  obtained  by  multiplying  positive  terms  of  g(k) 
by  Pi/  (1-Po)  or  p0/  (1-Pi)  , and  by  multiplying  negative  terms  by 
the  inverse  of  previous  factors  (they  are  less  than  1) , 
g(k+l)  > g(k)  for  all  k. 

Note  that  following  algebraic  fact  holds.  If  A - B = 0 and  a 
> 1,  then  aA  - (l/a)B  = - ((1/a)  A - aB)  . Since  g(k")  > 0,  by 
the  above  fact,  g(k"+l)  + g(k"-l)  > 0. 

Hence,  g(k*+t)  + g(k*-t)  > o for  all  t < k\ 

Since  k is  finite  for  a fixed  (p0,Pi)  point,  fn  can  be  made 
positive  by  choosing  n sufficiently  large.  That  is,  for 
sufficiently  large  n, 

A(SP)  “ A(SU)  > 0 if  p0  > 0.5  and  pr  > 0.5. 

Similarly,  A(Sp)  - A(SU)  > 0 if  p0  < 0.5  and  Pl  < 0.5. 
When  p0  < 0.5  and  px  > 0 . 5 , or  when  p0  > 0.5  and  px  < 0.5,  A(Sp) 
- A(SU)  < 0 by  the  same  reasoning  given  in  the  proof  of  Lemma 
5.2,  the  limiting  value  for  the  size  of  the  Sp  preferred 
region  is  .500. 

□ 


139 


Corollary  5.5:  Let  n be  a sufficiently  large  number.  If  the 
sample  size  increases  from  n to  n+2 , then  the  Sp  preferred 
region  does  not  decrease. 

Proof:  Let  n be  the  sample  size.  Without  loss  of  generality 
we  assume  n is  an  even  number.  Then  the  difference  of 
expected  prediction  accuracy,  fn/  is  defined  as  in  the  proof 
of  Lemma  5.4. 

Define  g(k)  and  k*  the  same  way  as  in  the  proof  of  Lemma  5.4, 
where  k = 1,2,  . . . , n/2-1. 

Let  g(k)  = gx(k)  + g2(k)  , 

where  gx(k)  = (px/2 ) n/2~k  ( ( l-p0) /2 ) n/2‘k  * 

( ( (2p0-l)/2)  (p1/2)2k  - ( (2p1-l)/2)  ( (l-p0)/2)2k)  , and 

g2(k)  = (Po/2 ) n/2k  ( ( l-p2)  /2 ) n/2'k  * 

( ( (2pi-l)/2)  (p0/2)2k  - ( (2p0-l)/2)  ( (l-Pl)/2)2k)  . 

Suppose  p0  > 0.5  and  px  > 0.5.  Without  loss  of  generality, 
further  suppose  px  > p0.  Then,  for  any  k, 

( (2p!-l)/2)  (p0/2)2k  - ( (2p0-l)/2)  ( (l-Pl)/2)2k  > 0. 

Suppose  fn  > 0.  fn  can  be  written  as  follows. 

fn  = g(l)  + g(2)  + ...  + g(k)  + ...  + g (n/2-1)  . 

Write  fn+2  as  follows. 

fn+2  - h(l)  + h(2)  + ...  + h(k)  + ....  + h(n/2), 
where  h(k)  = (Pi/2 ) <n+2)/2'k  ( ( i-p0) /2 ) <n+2)/2'k  * 

( ( (2p0-l)/2)  (Pl/2 ) 2k  - ( (2Pl-l)/2)  ( (l-p0)/2)2k) 

+ (Po/2  ) <n+2)/z'k  ( ( 1-Pl)  /2  ) (n+2)/2~k  * 

( ( (2Pl-l)/2)  (p0/2)2k  - ( (2p0-l)/2)  ( (l-Pl)/2)2k)  . 


140 


Then  fn+2  - h(n/2)  + (Pi/2)  ( ( l-p0) /2 ) * (gx(l)  + ...  +g1(n/2-l)) 
+ (Po/2)  ( (l"P1)/2)  * (g2  (1)  + ...  + g2  (n/2-1)  ) . 

For  sufficiently  large  n,  if  fn  > 0, 
then  gx  ( 1)  + ...  + gx(n/2-l)  > 0 must  hold. 

Since  h(n/2)  > 0,  and  all  multipliers  in  above  equation  are 
positive,  fn+2  > o follows. 

For  p0  < 0.5  and  px  < 0.5,  fn+2  > o by  a similar  reasoning. 

□ 

5. 4. 2. 2 Weighted  average  skewness 

Here  we  consider  the  number  of  permutations  of  each 
training  set,  and  give  weights  to  the  probabilities,  k, , and 
skewnesses  by  those  numbers.  The  analysis  procedure  is 
similar  to  the  "simple  average"  case.  However,  each  k,  is 
multiplied  by  the  number  of  permutations.  By  the  formula 
given  in  Subsection  5.4.1,  we  get  the  number  of  permutations. 
For  example,  for  a training  set  of  size  six,  the  number  of 
permutations  are  obtained  as  follows.  For  a 5:1  partitioned 
training  sets  (i.e. , five  examples  are  in  one  class  and  one 
example  is  in  the  other  class) , the  number  of  permutations  is 
ci  ~ 6.  And  for  a 4:2  partitioned  training  sets,  the  number 
of  permutations  is  Cx5  + c25  = 15. 
kx  = 6 * P{  ( 0 , P)  }5P{  ( 1 , N)  }x, 

k2  = 6 * P{  ( 0 , N)  }5P{  (i,P)  )\ 

k3  = 6 * P{  ( 1 , N)  }5P{  (0  , P)  }\ 

k*  = 6 * P{  (1,P)  }5P{  ( 0 , N)  }\ 


So,  we  get: 


141 


k5  = 15  * P{  (0,P)  }4P{  ( 1 , N)  }2, 

k6  = 15  * P{  (0 , N)  }4P<  (1,P)  }2, 

k7  = 15  * P{  (1,N)  }4P{  (0,P)  }2,  and 

k8  = 15  * P{  (1,P)  }4P{  ( 0 , N)  }2. 

Then  the  skewness  is 

= 6/21  * 5/6  + 15/21  * 4/6  = .714. 

By  performing  calculations  for  all  possible  pairs  of  p0  and  p: 
values,  we  get  the  following  results  shown  in  Table  5.3. 


Table  5.3:  Sample  size  vs.  Sp  region  for  ordered  sample 


Size  of 

Skewness 

Portion  of 

training  set 

of  set 

Sp  region" 

* 

ed 

3 

. 667 

. 352 

. 34 

4 

. 750 

.414 

. 26 

5 

. 667 

. 380 

.33 

6 

.714 

.422 

. 29 

7 

. 651 

. 394 

.35 

8 

. 685 

.428 

.31 

9 

. 635 

.401 

.36 

10 

. 662 

.429 

.33 

11 

. 623 

.411 

. 37 

12 

. 645 

. 434 

. 35 

very  large 

. 500 

. 00 

*•  Sp  preferred  region  has  been  calculated  approximately  by 
taking  10, 000  points  in  the  (p0,Pi)  plane. 

**:  ed"  is  the  amount  of  description  noise  where  the  Sp 
preferred  region  is  approximately  equal  to  the  Su  preferred 
region.  Here  we  set  ec  = 0 . 


In  Table  5.3  we  see  the  same  effect  of  a larger  training 
set  as  found  in  Table  5.2.  That  is,  we  can  easily  see  that 
for  even  or  odd  numbered  sequences,  the  Sp  preferred  regions 
are  monotonically  increasing.  However,  the  increase  is 


142 


relatively  slow  compared  to  the  former  "simple  average 
skewness"  case. 

Even  though  the  Sp  preferred  regions  are  monotonically 
increasing,  the  corresponding  ed*  is  increasing.  This  is 
opposite  to  the  case  of  "simple  average  skewness".  This  can 
be  explained  by  the  declining  skewness  since  ed*  decreases  as 
the  skewness  increases,  as  shown  in  Tables  5.1  and  5.2. 


Lemma  5.6:  Let  q:r  and  (q+l:r-l)  be  two  partitions  of  training 
sample.  Then  the  number  of  permutations  of  (q+l:r-l) 
partitioned  training  sample  is  r/ (q+1)  times  of  the  number  of 
permutations  of  (q:r)  partitioned  training  sample. 


Proof:  Note  that  the  number  of  permutations  for  q:r 

partitioned  training  sample  is  £ c^+cr  . 

C = 1 

Equate  the  number  of  permutations  for  q:r  and  (q+l:r-l) 
partitioned  training  sample  by  using  an  unknown  multiple  x. 


£ cjrf.c?*2  = x t d:i 


'•cr1 


c=i 


t=i 


We  can  rewrite  the  above  equation  as  follows. 

if  cr*c?:?  = x ¥ cr.cft  . 

C=0  C= 0 

By  the  identity  3.20  of  Gould  (1972)  which  is 

l c£*c£.r  = cz:?  , 

k=0 

the  above  equation  reduces  to 


(*:?  = x C*/*  . 


Hence,  x = r / (q+1)  follows. 

□ 


143 


Lemma  5.7:  For  the  ordered  sampling  case,  the  limiting  value 
for  the  size  of  the  Sp  preferred  region  is  .500  as  the  sample 
size  n goes  to  infinity. 

Proof:  The  ratio  of  the  number  of  permutations  for  k+1  (i.e., 
a partition  where  the  difference  between  q and  r is  2 (k+1))  to 
the  number  of  permutations  for  k is 

(n-2k)  / (n+2k+2)  for  even  number  n,  or 
(n-2k+l)  / (n+2k+l)  for  odd  number  n 
by  Lemma  5.6.  This  ratio  is  getting  larger  as  n increases  for 
any  fixed  k.  Since  k*  is  finite  by  the  proof  of  Lemma  5.4, 
the  sum  of  the  number  of  permutations  for  all  k < k"  can  be 
exceeded  by  the  sum  of  the  number  of  permutations  for  all  k > 
k for  sufficiently  large  n.  Since  the  expected  prediction 
accuracies  are  multiplied  by  the  number  of  permutations,  the 
reasoning  in  the  proof  of  Lemma  5.4  holds  for  this  case  also. 
□ 

5.4.3  Discussion 

Ideally,  pruning  should  lead  to  concept  simplification 
with  better  predictions  or  at  least  a small  loss  of  prediction 
accuracy.  In  this  chapter  we  have  developed  conditions  under 
which  pruning  is  necessary  to  obtain  better  prediction 
accuracy.  We  have  analyzed  this  problem  in  three  directions. 

1)  Increasing  skewness. 

2)  Increasing  sample  size. 


144 


3)  Increasing  noise  levels. 

All  three  have  a positive  effect  on  the  relative  merit  of 
pruning.  In  particular,  for  higher  skewness  and/or  larger 
training  sets,  the  Sp  preferred  region  approaches  50%  when 
there  is  no  noise  of  any  kind.  That  is,  a pruned  tree  has 
almost  an  equal  level  of  prediction  accuracy  as  that  of 
unPruned  tree.  The  situation  becomes  more  favorable  to  the 
pruned  tree  when  description  noise  exists. 

Now  we  give  our  main  result  in  Theorem  5.8. 

Theorem  5.8:  Under  the  assumption  of  a uniform  distribution, 
as  the  skewness  increases  and/or  sample  size  increases,  the 
merit  of  pruning  increases,  and  the  limiting  value  for  the 
size  of  the  Sp  preferred  region  is  50%. 

We  also  considered  the  degree  to  which  prediction 
accuracies  for  the  two  strategies  differ  at  a given  (Po>Pi) 
point.  The  amount  of  difference  in  quality  of  these 
strategies  depends  on  the  distributions  for  p0  and  px.  If  we 
assume  a uniform  distribution,  and  take  10,000  points  at  equal 
intervals  from  the  (p0,Pi)  unit  square,  we  get  the  average  of 
the  differences  of  the  prediction  accuracies,  shown  in  Table 
5.4.  The  results  are  based  on  the  "weighted  average"  cases 
described  in  Subsection  5. 4. 2. 2. 


145 


Table  5.4:  Average  difference  of  prediction  accuracy 
Sample  size  Skewness  Sp  region  Difference" 


3 

. 667 

. 353 

. 073 

4 

.750 

.414 

. 041 

10 

. 662 

. 429 

. 030 

11 

. 623 

.411 

. 037 

12 

. 645 

.434 

. 028 

*:  This  is  the  average  difference  of  the  prediction  accuracy 
between  Sp  and  Su  for  a fixed  (p0,  pj  point. 

As  we  see  in  Table  5.4,  the  loss  of  prediction  accuracy 
for  the  pruned  tree  is  less  than  4%  for  sample  sizes  greater 
than  ten.  This  small  amount  of  loss  in  prediction  accuracy  as 
a result  of  pruning  is  often  an  acceptable  trade-off  for 
producing  a simple  concept. 


CHAPTER  6 

SUMMARY  AND  FUTURE  RESEARCH 


Empirical  results  have  shown  that  pruning  can  improve  the 
accuracy  of  an  induced  decision  tree.  Pruning  also  leads  to 
concise  rules.  In  Chapter  3 we  provide  a pruning 
algorithm  based  on  the  rank  of  a decision  tree.  A bound  on 
the  error  due  to  pruning  by  the  rank  of  a decision  tree  is 
determined  under  the  assumptions  of  an  equally  likely 
distribution  over  the  instance  space  and  a deterministic  tree 
labelling  rule.  This  bound  is  then  used  with  recent  results 
in  learning  theory  to  determine  a sample  size  sufficient  for 
PAC  identification  of  decision  trees  with  pruning.  We  also 
discuss  other  pruning  rules  and  their  effects  on  the  error  due 
to  pruning.  With  a nondeterministic  tree  labelling  rule  we 
show  that  the  upperbound  of  the  average  pruning  error  is  less 
than  or  equal  to  0.5  under  an  equally  likely  distribution. 

In  Chapter  3 we  provide  a bound  on  the  training  sample 
size  to  guarantee  PAC  learning  of  a decision  tree  with 
pruning.  In  a realistic  learning  environment  it  is  often  not 
possible  to  obtain  a large  enough  sample.  For  those  cases,  we 
provide  several  methods  for  a posterior  evaluation  of  the 
accuracy  of  a pruned  decision  tree  in  Chapter  4.  We  give  a 


146 


147 


method  which  estimates  a lower  bound  for  the  worst  possible 
confidence  factor,  8,  by  using  a Beta  prior.  Also,  we  give  a 
more  detailed  view  of  the  meaning  of  this  lower  bound,  and 
suggest  a way  to  improve  this  lower  bound. 

In  Chapter  5 we  develop  conditions  under  which  pruning  is 
necessary  for  better  prediction  accuracy  as  well  as  for 
concept  simplification.  We  give  an  analysis  of  the  reason  why 
pruning  is  necessary  in  realistic  learning  situations. 

We  generalize  Schaffer's  (1991)  results  for  larger 
training  sets.  A Bayesian  analysis  shows  that  the  average 
prediction  accuracy  of  the  pruned  tree  increases,  and  the 
effect  of  description  noise  becomes  stronger  as  the  size  of 
the  training  set  increases.  For  very  large  training  sets,  the 
pruned  tree  has  the  prediction  accuracy  equal  to  that  of  the 
unpruned  tree. 

Future  work  will  be  needed  to  determine  the  pruning  error 
under  more  general  assumptions  on  the  distribution  over  the 
instance  space.  Also,  Theorem  3.23  can  be  tightened  if  an  a 
priori  estimate,  k,  of  the  rank  of  the  induced  decision  tree 
can  be  determined.  If  so,  jun  r can  be  replaced  by  Mr  r* 

Here  we  take  the  rank  of  a tree  as  a conciseness  measure 
of  a decision  tree.  Future  work  will  be  needed  to  assess  the 
effect  of  pruning  under  other  conciseness  criteria. 

Future  work  will  be  needed  to  find  a more  direct,  and 
convenient  way  to  find  8,  in  particular,  a way  to  find  an 
improved  upperbound  of  8. 


148 


Future  work  will  be  also  needed  to  investigate  the 
conditions,  under  which  pruning  is  useful,  for  more  complex 
situations,  such  as  larger  decision  trees  having  more  than  one 
node  and  nonbinary  trees. 


REFERENCES 


Abramowitz,  M.  and  Segun,  I.  (1968) , Handbook  of  Mathematical 
Functions . Dover  Publications,  New  York. 

Angluin,  D.  and  Laird,  P.  (1988)  , "Learning  From  Noisy 
Examples,"  Machine  Learning.  2,  343-370. 

Angluin,  D.  and  Smith,  C.H.  (1983) , "Inductive  Inference: 
Theory  and  Methods,"  Computing  Surveys.  15,  237-269. 

Blumer , A.,  Ehrenfeucht,  A.,  Haussler,  D.  and  Warmuth,  M. 

(1987a) , "Occam  1 s Razor, " Information  Processing  Letters. 
24,  377-380. 

Blumer,  A.,  Ehrenfeucht,  A.,  Haussler,  D.  and  Warmuth,  M. 
(1987b),  "Learnability  and  the  Vapnik-Chervonenkis 
Dimension,"  Technical  Report  UCSC-CRL-87-2 0 , University 
of  California,  Santa  Cruz,  CA. 

Boose,  J.  H.  and  Gaines,  B.R.  (1989) , "Knowledge  Acquisition 
for  Knowledge-Based  Systems:  Notes  on  the  State-of-the 
Art,"  Machine  Learning.  4,  377-394. 

Boucheron , S.  and  Sallantin,  J.  (1988) , "Learning  in  the 
Presence  of  Noise,"  Proceedings  of  the  Third  European 
Working  Session  on  Learning.  London,  25-35. 

Braun,  H.  and  Chandler,  J.S.  (1987),  "Predicting  Stock  Market 
Behavior  Through  Rule  Induction:  An  Application  of  the 
Learn ing-from-Example  Approach,"  Decision  Sciences.  18. 
415-429. 

Breiman,  L.  , Freidman,  J.,  Olshen,  R.  and  Stone,  C.  (1984) 
Classification  and  Regression  Trees.  San  Francisco: 
Wadsworth  International. 

Buntine,  W.  (1989),  "Learning  Classification  Rules  using 
Bayes , " Proceedings  of  the  6th  International  Conference 
on  Machine — Learning  (pp. 146-150)  , San  Mateo,  CA:  Morgan 
Kaufmann. 

Carter,  C.  and  Catlett,  J.  (1987),  "Assessing  Credit  Card 
Applications  Using  Machine  Learning,"  IEEE  Expert.  2,  71- 
79 . 


149 


150 


Chan,  P.K.  (1989),  "Inductive  Learning  with  BCT , " Proceedings 

of the  6th  International  Workshop  on  Machine  Learning 

(pp.  104-108),  San  Mateo,  CA:  Morgan  Kaufmann. 

Chan,  K.  and  Wong,  A.  (1990),  "Performance  Analysis  of  a 
Probabilistic  Inductive  Learning  System,"  Proceedings  of 

the 7th  International  Conference  on  Machine  Learning 

(pp. 16-23),  San  Mateo,  CA:  Morgan  Kaufmann. 

Cheng,  J.,  Fayyad,  U.M.,  Irani,  K.B.  and  Qian,  Z.  (1988), 
"Improved  Decision  Trees:  A Generalized  Version  of  ID3," 
Proceedings  of  the  5th  International  Conference  on 
Machine  Learning  (pp.  100-106),  San  Mateo,  CA:  Morgan 
Kaufmann. 

Chou,  P . A . (1991),  "Optimal  Partitioning  for  Classification 

and  Regression  Trees,"  IEEE  Transactions  on  Pattern 
Analysis  and  Machine  Intelligence.  13,  4,  340-354. 

Cohen,  P.  R.  and  Feigenbaum,  E . A . (1982),  The  Handbook  of 

Artificial  Intelligence.  Vol  III.  Reading,  MA:  Addison- 
Wesley . 

Clark,  P.  and  Niblett,  T.  (1987),  "Induction  in  Noisy 

Domains,"  Proceedings  of  the  Second  European  Working 
Session  on  Learning  (pp.  11-30),  Bled,  Yugoslavia:  Sigma 
Press . 

Clark,  P.  and  Niblett,  T.  (1989),  "The  CN2  Induction 

Algorithm,"  Machine  Learning.  3,  261-283. 

Davis,  R.  (1982),  "TEIRESIAS : Applications  of  Meta-Knowledge, " 
In  Knowledge  Based  Systems  in  Artificial  Intelligence 
(pp.  227-408),  R.  Davis  and  D.  Renat  (Eds.),  New  York: 
McGraw-Hill . 

Ehrenfeucht,  A.  and  Haussler,  D.  (1988),  "Learning  Decision 
Trees  From  Random  Examples,"  Proceedings  of  the  1988 
Workshop  on  Computational  Learning  Theory  (pp.  182-194) , 
San  Mateo,  CA:  Morgan  Kaufmann. 

Ehrenfeucht,  A.,  Haussler,  D.  , Kearns,  M.  and  Valiant,  L. 
(1989) , "A  General  Lower  Bound  on  the  Number  of  Examples 
Needed  for  Learning,"  Information  and  Computation.  82,  3, 
247-261. 

Fisher,  D.  H.  and  Schlimmer,  J.  C.  (1988),  "Concept 
Simplification  and  Prediction  Accuracy."  Proceedings  of 
the  5th  International  Conference  on  Machine  Learning  ( pp . 
22-28),  San  Mateo,  CA:  Morgan  Kaufmann. 


151 


Garson,  B.  (1988) , The  Electronic  Sweatshop:  How  Computers  Are 
Transforming  the  Office  of  the  Future  Into  the  Factory  of 
the  Past.  New  York:  Simon  and  Schuster. 

Gelfand,  S.B.,  Ravishankar,  C.S.,  and  Delp,  E.J.  (1991),  "An 
Iterative  Growing  and  Pruning  Algorithm  for 
Classification  Tree  Design,"  IEEE  Transactions  on  Pattern 
Analysis  and  Machine  Intelligence.  13,  2,  163-174. 

Gould,  H.W.  (1972),  Combinatorial  Identities.  Morgantown,  WV: 
Morgantown  Printing  and  Binding  Co. 

Haussler,  D.  (1988),  "Quantifying  Inductive  Bias:  AI  Learning 
Algorithms  and  Valiant's  Learning  Framework,"  Artificial 
Intelligence . 36,  177-221. 

Haussler  D.  (1989),  "Learning  Conjunctive  Concepts  in 
Structural  Domains,"  Machine  Learning.  4,  7-40 

Haussler,  D.  (1990) , "Decision  Theoretic  Generalization  of  the 
PAC  Model  for  Neural  Net  and  Other  Learning 
Applications,"  Technical  Report,  UCSC-CRL-91-02 , 
University  of  California,  Santa  Cruz. 

Hirsh,  H.  (1990),  "Learning  from  Data  with  Bounded 
Inconsistency,"  Proceedings  of  the  7th  International 
Conference  on  Machine  Learning  (pp. 32-39),  San  Mateo,  CA: 
Morgan  Kaufmann. 

Johnson,  D.E.  (1983),  "What  Kind  of  Expert  Should  a System 
Be? , " Journal  of  Medicine  and  Philosophy.  8,  77-97. 

Johnson,  N.  and  Kotz,  S.  (1969) , Discrete  Distributions. 
Boston:  Houghton  Mifflin  Co. 

Koehler,  G.J.  and  Majthay,  A.  (1988),  "Generalization  of 
Quinlan's  Induction  Method,"  Department  of  Decision  and 
Information  Sciences,  University  of  Florida,  Unpublished 
Manuscript . 

Laird,  P.  (1987),  Learning  From  Good  Data  and  Bad.  Doctoral 
Dissertation,  Department  of  Computer  Science,  Yale 
University,  New  Haven,  CT. 

Landell,  L.  (1990),  "Induction  as  Optimization,"  IEEE 
Transactions  on  Systems.  Man,  and  Cybernetics.  20,  2, 

326-338 . 

Marshall,  R.  (1986),  Partitioning  Methods  for  Classification 
and  Decision  Making  in  Medicine,  Statistics  in  Medicine. 
5,  517-526. 


152 


Messier,  W.F.  and  Hansen,  J.V.  (1988),  "Inducing  Rules  for 
Expert  Systems  Development,"  Management  Science.  34,  12, 
1403-1415. 

Michalski,  R.  S.  (1983),  "A  Theory  and  Methodology  of 
Inductive  Learning,"  Artificial  Intelligence.  20,  111- 

161. 

Michalski,  R.S.  and  Chilausky,  C.  (1980),  "Learning  by  Being 
Told  and  Learning  From  Examples:  An  Experimental 

Comparision  of  the  Two  Methods  of  Knowledge  Acquisition 
in  the  Context  of  Developing  an  Expert  System  for  Soybean 
Disease  Diagnosis,"  International  Journal  of  Policy 
Analysis  and  Information  Systems.  4,  125-161. 

Mingers,  J.  (1986),  "Expert  Systems — Experiments  with  Rule 
Induction,"  Journal  of  the  Operational  Research  Society. 
37,  1031-1037. 

Mingers,  J.  (1987),  "Expert  Systems — Rule  Induction  with 
Statistical  Data,"  Journal  of  the  Operational  Research 
Society.  38,  39-47. 

Mingers,  J.  (1989a),  "An  Empirical  Comparison  of  Selection 
Measures  for  Decision  Tree  Induction,"  Machine  Learning. 
3,  319-342. 

Mingers,  J.  (1989b),  "An  Empirical  Comparison  of  Pruning 
Methods  for  Decision  Tree  Induction,"  Machine  Learning  4. 
227-243 . 

Mitchell,  T.M.  (1982),  "Generalization  as  Search," 

Artificial  Intelligence.  18,  203-226. 

Musen,  M.  A.  (1989),  "Automated  Support  for  Building  and 
Extending  Expert  Models,"  Machine  Learning.  4,  347-375. 

Natarajan,  B.  K.  (1991),  Machine  Learning:  A Theoretical 

Approach , San  Mateo,  CA:  Morgan  Kaufmann. 

Niblett,  T.  and  Bratko,  I.  (1986),  "Learning  Decision  Rules  in 
Noisy  Domains,"  In  M.A.  Bramer  (Ed.) , Research  and 
Development  in  Expert  Systems  III  fop.  25-34),  Cambridge; 
Cambridge  University  Press. 

Niblett,  T.  (1987),  "Constructing  Decision  Trees  in  Noisy 
Domains, "Proceedings  of  the  Second  European  Working 
Session  on  Learning  (pp.  67-78),  Bled,  Yugoslavia:  Sigma 
Press. 

Nunez,  M.  (1991),  "The  Use  of  Background  Knowledge  in  Decision 
Tree  Induction,"  Machine  Learning.  6,  231-250. 


153 


Parker,  C.S.  (1989),  Management  Information  Systems:  Strategy 
and  Action.  New  York:  McGraw-Hill. 

Peizer,  D.B.  and  Pratt,  J.W.  (1968),  "A  Normal  Approximation 
for  Binomial,  F,  Beta  and  other  Common  Related  Tail 
Probabilities  I,"  Journal  of  the  American  Statistical 
Association . 63,  1416-1456. 

Quinlan,  R.  (1979),  "Discovering  Rules  from  Large  Collection 
of  Examples:  A Case  Study,"  In  D.  Michie(Ed.),  Expert 
Systems  in  the  Microelectronic  Age  (pp.  168-201) . 
Edinburgh:  Edinburgh  University  Press. 

Quinlan,  R.  (1983),  "The  Effect  of  Noise  in  Concept  Learning," 
In  R.S.  Michalski , J.  Carbonell,  T.  Mitchell (Eds .) , 
Machine Learning:  An  Artificial  Intelligence  Approach. 

• IXf  Chapter  6,  Los  Altos,  CA:  Morgan  Kaufmann. 

Quinlan,  R.  (1986),  "Induction  of  Decision  Trees."  Machine 
Learning.  1,  86-106. 

Quinlan,  R.  (1987a),  "Simplifying  Decision  Trees," 
International  Journal  of  Man-Machine  Studies.  27,  221- 
234 . 

Quinlan,  R.  (1987b),  "Generating  Production  Rules  from 
Decision  Trees,"  Proceedings  of  the  10th  International 

Joint Conference  on  Artificial  Intelligence  (pp.  304- 

307),  Los  Altos,  CA:  Morgan  Kaufmann. 

Quinlan,  R.  (1990),  "Decision  Trees  and  Decisionmaking,"  IEEE 
Transactions  on  Systems,  Man,  and  Cybernetics.  20,  2, 

339-346. 

Quinlan,  R.  and  Rivest,  R.L.  (1989),  "Inferring  Decision  Trees 
Using  the  Minimum  Description  Length  Principle," 
Information  and  Computation.  80,  227-248. 

Raiffa/  and  Schlaifer  (1961) , Applied  Statistical  Decision 
Theory , Cambridge,  MA:  Division  of  Research,  Harvard 

Business  School. 

Rivest,  R.  (1987),  "Learning  Decision  Lists,"  Machine 
Learning.  2(3),  229-246. 

Schaffer,  C.  (1991),  "When  does  Overfitting  Decrease 
Prediction  Accuracy  in  Induced  Decision  Trees  and  Rule 
Sets?"  Lecture  Notes  in  Artificial  Intelligence:  Machine 
Learning-EWSL-91 . Porto,  Portugal:  Springer-Verlag . 

Shackelford,  G.  and  Volper,  D.  (1988),  "Learning  k-DNF  with 
Noise  in  the  Attributes,"  Proceedings  of  the  1988 


154 


Workshop  on  Computational  Learning  Theory  (pp.  97-103), 
San  Mateo,  CA:  Morgan  Kaufmann. 

Shaw,  M.  J.  and  Gentry,  J.A.  (1988),  "Using  an  Expert  System 
with  Inductive  Learning  to  Evaluate  Business  Loans," 
Financial  Management.  17,  45-56. 

Shaw,  M.  J.  and  Gentry,  J.A.  (1990),  "Inductive  Learning  for 
Risk  Classification,"  IEEE  Expert.  5,  47-53. 

Shaw,  M.  J.,  Gentry,  J.A.  and  Piramuthu,  S.  (1990),  "Inductive 
Learning  Methods  for  Knowledge-Based  Decision  Support:  A 
Comparative  Analysis,"  Computer  Science  in  Economics  and 
Management . 3,  147-165. 

Simon,  H.  (1983),  "Why  Should  Machines  Learn?"  In  R.S. 
Michalski , J.  Carbonell,  T.  Mitchell (Eds .)  , Machine 
Learning:  An  Artificial  Intelligence  Approach.  Vol . I 
(pp.  25-37),  Palo  Alto,  CA:  Tioga. 

Simon,  H.  U.  (1990),  "On  the  Number  of  Examples  and  Stages 
Needed  for  Learning  Decision  Trees,"  Proceedings  of  the 
Third  Annual  Workshop  on  Computational  Learning  Theory 
(pp.  303-313),  San  Mateo,  CA:  Morgan  Kaufmann. 

Spangler,  S.,  Fayyad,  U.M.  and  Uthurusamy,  R.  (1989), 
"Induction  of  Decision  Trees  from  Inconclusive  Data," 

Proceedings of  the  6th  International  Conference  on 

Machine — Learning  (pp.  146-150)  , San  Mateo,  CA:  Morgan 

Kaufmann . 

Tsai,  L.  and  Koehler,  G.J.  (in  Press)  , "The  Accuracy  of 
Concepts  Learned  from  Induction,"  Decision  Support 
System . forthcoming. 

Utgoff,  P.  (1989),  "Incremental  Induction  of  Decision  Trees," 
Machine  Learning.  4,  161-186. 

Utgoff,  P.  and  Brodley,  C.  (1990),  "An  Incremental  Method  for 
Finding  Multivariate  Splits  for  Decision  Trees," 

Proceedings of  the  7th  International  Conference  on 

Machine Learning  (pp. 58-65),  San  Mateo,  CA:  Morgan 

Kaufmann . 

Valiant,  L.G.  (1984),  "A  Theory  of  the  Learnable," 
Communications  of  the  ACM.  27(11),  1134-1142. 

Valiant,  L.G.  (1985),  "Learning  Disjunctions  of  Conjunctions," 

Proceedings of  the  9th  International  Joint  Conference  on 

Artificial Intelligence  (Vol.  1,  pp.  560-566),  Los 

Angeles,  CA:  Morgan  Kaufmann. 


155 


Van  de  Velde,  W.  (1990),  "Incremental  Induction  of 
Topologically  Minimal  Decision  Trees,"  Proceedings  of  the 
7th  International  Conference  on  Machine  Learning  (pp.66- 
74),  San  Mateo,  CA:  Morgan  Kaufmann. 

Vapnik,  V.N.  (1982),  Estimati  on  of  Dependencies  Based  on 
Empirical  Data.  New  York:  Springer-Verlag . 

Weiss,  S.  M.  and  Galen,  R.S.,  and  P.V.  Tadepalli  (1987), 
"Optimizing  the  Predictive  Value  of  Diagnostic  Decision 
Rules/"  Proceedings  of  the  Sixth  National  Conference  on 
Artificial  Intelligence  (dp.  521-526),  San  Mateo,  CA: 
Morgan  Kaufmann. 

Weiss,  S.  M.  and  Kulikowski,  C.  A.  (1991) , Computer  Systems 
that  Learn,  Los  Altos,  CA:  Morgan  Kaufmann. 

Winkler,  R.  L.  (1972),  Introduction  to  Bayesian  Inference  and 
Decision , New  York:  Holt,  Rinehart  and  Winston. 

Wirth,  J.  (1988),  "Experiments  on  the  Costs  and  Benefits  of 
Windowing  in  ID3,"  Proceedings  of  the  5th  International 
Conference  on  Machine  Learning  (pp. 87-99),  San  Mateo,  CA: 
Morgan  Kaufmann. 

Zhou,  X.J.  and  Dillon,  T.S.  (1991),  "A  Statistical-Heuristic 
Feature  Selection  Criterion  for  Decision  Tree  Induction," 
— Transactions  on  Pattern  Analysis  and  Machine 
Intelligence.  13,  8,  834-841. 


BIOGRAPHICAL  SKETCH 


Hyunsoo  Kim  was  born  on  August  6,  1958  in  Geumreung, 
Korea,  and  moved  to  Seoul,  Korea,  in  1967,  where  he  lived 
until  he  came  to  the  United  States  for  advanced  study  in  1989. 
He  graduated  Munsung  elementary  school,  Kangnam  middle  school, 
and  Baemoon  high  school  in  1971,  1974,  and  1977,  respectively. 
In  March  of  1977  he  entered  Seoul  National  University  and 
graduated  in  February  1982  with  a B.S.  degree  in  nuclear 
engineering. 

With  a strong  motivation  to  build  a career  in  business 
and  management  he  joined  the  Management  Science  graduate 
program  in  Korea  Advanced  Institute  of  Science  and  Technology 
(KAIST)  in  March  1983,  and  received  his  master's  degree  in 
Management  Science  with  a management  information  systems  (MIS) 
concentration  in  February  1985.  From  March  1985  to  May  1988 
he  was  with  Data  Communications  Corp.  of  Korea,  Seoul,  Korea, 
where  he  had  finished  several  pioneering  research  projects  in 
MIS  and  software  engineering  including  "Software  Development 
and  Maintenance  Cost  Estimation  in  Korea."  From  June  1988  to 
August  1989  he  was  with  the  Information  Culture  Center,  Seoul, 
Korea,  where  he  taught  many  courses  in  MIS. 


156 


157 


In  August  1989  he  joined  the  Ph.D.  program  at  the 
University  of  Florida.  He  has  been  involved  in  various 
research  projects  in  expert  systems  and  machine  learning.  He 
has  written  four  papers  for  publication  as  a part  of  his 
dissertation. 


I certify  that  I have  read  this  study  and  that  in  my 
opinion  it  conforms  to  acceptable  standards  of  scholarly 
presentation  and  is  fully  adequate,  in  scope  and  quality,  as 
a dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Gary  lK  Koehler, Chair 

Professor  of  Decision  and  Information 


Sciences 


I certify  that  I have  read  this  study  and  that  in  my 
opinion  it  conforms  to  acceptable  standards  of  scholarly 
presentation  and  is  fully  adequate,  in  scope  and  quality,  as 
a dissertation  for  the  degree  of  Doctor  of  Philosophy. 

Harold  P.  Benson 

Professor  of  Decision  and  Information 
Sciences 


I certify  that  I have  read  this  study  and  that  in  my 
opinion  it  conforms  to  acceptable  standards  of  scholarly 
presentation  and  is  fully  adequate,  in  scope  and  quality,  as 
a dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Mark  Pendergast 

Assistant  Professor  of  Decision  and 
Information  Scineces 


I certify  that  I have  read  this  study  and  that  in  my 
opinion  it  conforms  to  acceptable  standards  of  scholarly 
presentation  and  is  fully  adequate,  in  scope  and  quality,  as 
a dissertation  for  the  degree  of  Doctor  of  Philosophy. 


This  dissertation  was  submitted  to  the  Graduate  Faculty 
of  the  Department  of  Decision  and  Information  Sciences  in  the 
College  of  Business  Administration  and  to  the  Graduate  School 
and  was  accepted  as  partial  fulfillment  of  the  requirements 
for  the  degree  of  Doctor  of  Philosophy. 


May  1992  

Dean,  Graduate  School 


